IROS 2025 Program | Wednesday October 22, 2025


WeAT1	401
Award Finalists 5	Regular Session
Co-Chair: Tan, Jindong	University of Tennessee, Knoxville

10:30-10:35, Paper WeAT1.1
Robust Ladder Climbing with a Quadrupedal Robot

Vogel, Dylan	ETH Zurich
Baines, Robert Lawrence	ETH Zurich
Church, Joseph	ETH RSL
Lotzer, Julian	ETH Zurich
Werner, Karl	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Field Robots, Legged Robots, Industrial Robots Abstract: Quadruped robots are proliferating in industrial environments where they carry sensor payloads and serve as autonomous inspection platforms. Despite the advantages of legged robots over their wheeled counterparts on rough and uneven terrain, they are still unable to reliably negotiate a ubiquitous feature of industrial infrastructure: ladders. Inability to traverse ladders prevents quadrupeds from inspecting dangerous locations, puts humans in harm's way, and reduces industrial site productivity. In this paper, we learn quadrupedal ladder climbing via a reinforcement learning-based control policy and a complementary hooked end effector. We evaluate the robustness in simulation across different ladder inclinations, rung geometries, and inter-rung spacings. On hardware, we demonstrate zero-shot transfer with an overall 90% success rate at ladder angles ranging from 70° to 90°, consistent climbing performance during unmodeled perturbations, and climbing speeds 232x faster than the state of the art. This work expands the scope of industrial quadruped robot applications beyond inspection on nominal terrains to challenging infrastructural features in the environment, highlighting synergies between robot morphology and control policy when performing complex skills. More information can be found at the project website: https://sites.google.com/leggedrobotics.com/climbingladders .

10:35-10:40, Paper WeAT1.2
FlipWalker: Jacob's Ladder Toy-Inspired Robot for Locomotion across Diverse, Complex Terrain

Li, Diancheng	Queen’s University
Ralston, Nia	Queen's University
Hagen, Bastiaan	Queen's University
Tan, Phoebe	Queen's University
Robertson, Matthew	Queen's University
Keywords: Climbing Robots, Legged Robots, Flexible Robotics Abstract: This paper introduces FlipWalker, a novel underactuated robot locomotion system inspired by Jacob's Ladder illusion toy, designed to traverse challenging terrains where wheeled robots often struggle. Like the Jacob’s Ladder toy, FlipWalker features two interconnected segments joined by flexible cables, enabling it to pivot and flip around singularities in a manner reminiscent of the toy’s cascading motion. Actuation is provided by motor-driven legs within each segment that push off either the ground or the opposing segment, depending on the robot’s current configuration. A physics-based model of the underactuated flipping dynamics is formulated to elucidate the critical design parameters governing forward motion and obstacle clearance or climbing. The untethered prototype weighs 0.78 kg, achieves a maximum flipping speed of 0.2 body lengths per second. Experimental trials on artificial grass, river rocks, and snow demonstrate that FlipWalker’s flipping strategy, which relies on ground reaction forces applied normal to the surface, offers a promising alternative to traditional locomotion for navigating irregular outdoor terrain.

10:40-10:45, Paper WeAT1.3
PainDiffusion: Learning to Express Pain

Dam, Quang Tien	Ritsumeikan University
Nguyen, Tri Tung Nguyen	Ritsumeikan University
Endo, Yuuki	Ritsumeikan University
Tran, Dinh Tuan	Shiga University
Lee, Joo-Ho	Ritsumeikan University
Keywords: Gesture, Posture and Facial Expressions, Emotional Robotics, Perception-Action Coupling Abstract: Accurate pain expression synthesis is essential for improving clinical training and human-robot interaction. Current Robotic Patient Simulators (RPSs) lack realistic pain facial expressions, limiting their effectiveness in medical training. In this work, we introduce PainDiffusion, a generative model that synthesizes naturalistic facial pain expressions. Unlike traditional heuristic or autoregressive methods, PainDiffusion operates in a continuous latent space, ensuring smoother and more natural facial motion while supporting indefinite-length generation via diffusion forcing. Our approach incorporates intrinsic characteristics such as pain expressiveness and emotion, allowing for personalized and controllable pain expression synthesis. We train and evaluate our model using the BioVid HeatPain Database. Additionally, we integrate PainDiffusion into a robotic system to assess its applicability in real-time rehabilitation exercises. Qualitative studies with clinicians reveal that PainDiffusion produces realistic pain expressions, with a 31.2% ± 4.8% preference rate against ground-truth recordings. Our results suggest that PainDiffusion can serve as a viable alternative to real patients in clinical training and simulation, bridging the gap between synthetic and naturalistic pain expression.

10:45-10:50, Paper WeAT1.4
Learning from Human Conversations: A Seq2Seq Based Multi-Modal Robot Facial Expression Reaction Framework in HRI

Shangguan, Zhegong	University of Manchester
Hei, Xiaoxuan	ENSTA Paris, Institut Polytechnique De Paris
Li, Fangjun	University of Manchester
Yu, Chuang	University College London
Song, Siyang	University of Exeter
Zhao, Jianzhuang	Istituto Italiano Di Tecnologia
Cangelosi, Angelo	University of Manchester
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Social HRI, Humanoid Robot Systems, Imitation Learning Abstract: Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes. Our code is provided at https://github.com/mrsgzg/Robot_Face_Reaction.

10:50-10:55, Paper WeAT1.5
AeroBuoy: A Drone Deployable, 3D Printed, Autonomous Robotic Buoy for Environmental Inspection in Remote and Hazardous River Systems

O’Brien, Reuben	The University of Auckland
Lynch, Angus	University of Auckland
Liarokapis, Minas	The University of Auckland
Keywords: Field Robots, Marine Robotics, Surveillance Robotic Systems Abstract: Monitoring of waterways such as remote and hazardous rivers and streams is important so as to assess the impact of external factors including construction runoff or climate change. Versatile, autonomous robotic boats can offer excellent environmental inspection and monitoring solutions for remote, dangerous, or access protected water bodies but they have several shortcomings in terms of maneuverability. This paper proposes an environmental inspection system consisting of an autonomous data collection buoy which is designed to be deployed to inaccessible river systems using a drone. The system can perform a drop off and pickup of the buoy depending on the requirements of a particular location and monitoring task. Utilising the natural flow of the river the buoy autonomously steers down, using GPS and magnetometers so as to maintain the desired trajectory. The buoy is capable of measuring water temperature but it can also be equipped with a range of sensors such as water oxygen meter, sonar for river bed inspection, or turbidity for water clarity. This paper describes the system design, presents an analysis of the self-righting capabilities of the buoy, and shows a full system demonstration at the Orewa River in Auckland, New Zealand.

10:55-11:00, Paper WeAT1.6
A Hybrid Mapping Method: Balancing Efficiency and Intuitiveness in Lateral Teleoperation

Xie, Yuwei	Zhejiang University
Wang, Ruize	Zhejiang University
Chen, Jiming	Zhejiang University
Li, Gaofeng	Zhejiang University
Keywords: Telerobotics and Teleoperation, Robotics in Hazardous Fields, Search and Rescue Robots Abstract: Mobile manipulators integrate the locomotion flexibility of quadruped robots with the operational capabilities of robotic manipulators. This integrated system is particularly effective for teleoperating explosive ordnance disposal (EOD) tasks in hazardous environments, enabling the safe handling of explosive devices. However, when the quadruped operates in narrow corridors or cluttered spaces, its ability to reposition is limited. This limitation, combined with targets located laterally relative to the robot, poses critical challenges for achieving rapid and intuitive teleoperation of the manipulator. Existing manipulator mapping methods either fail to support lateral teleoperation or lack proper coordinate transformations, leading to mismatches between the intended and actual movement directions of the leader and follower devices. This reduces operational intuitiveness and increases the cognitive load on human operators. To overcome these issues, we propose a hybrid mapping method that combines joint-space velocity control with Cartesian-space control. This method leverages joint-space velocity commands for rapid manipulator reorientation, while employing Cartesian space commands to achieve precise end-effector teleoperation. Furthermore, we introduce a virtual base coordinate frame that adaptively adjusts in response to the manipulator's reorientation. This adaptive compensation ensures that the visual feedback from the camera mounted on the end-effector remains consistent and intuitive. The proposed method was validated through experiments on a quadruped robot equipped with a manipulator in an EOD scenario. Results demonstrated significant improvements, including a 100% success rate, 43.9% task duration reduction, and 31.7% NASA-TLX score decrease, indicating decreased cognitive load and enhanced task efficiency compared to baseline methods.

11:00-11:05, Paper WeAT1.7
Detecting Obstacles on Railroads Using Computer Vision on UAVs

Anand, Aryan	University of South Carolina
Krishna, Nikhil	University of South Carolina
Vitzilaios, Nikolaos	University of South Carolina
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Transportation, Recognition Abstract: Obstacles on railroads significantly increase the risk of traveling with a lot of train accidents caused by undetected obstacles. The obstacles disturb both the shipments of goods and the transportation of people leading to delays and damage which then result in substantial financial losses. Following natural disasters, manually locating and removing obstacles is not only time-consuming but also hazardous for the personnel involved. To address these challenges, this paper proposes an object detection system that can be implemented on an aerial drone to detect obstacles on the railway. This approach aims to enhance railway safety, reduce costs, and ensure the timely delivery of essential goods such as food and medical supplies during emergencies.

11:05-11:10, Paper WeAT1.8
C-TRAC: Terrain-Adaptive Control for Articulated Tracked Robots Via Contact-Aware Reinforcement Learning

Pan, Hainan	National University of Defense Technology
Huang, Kaihong	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Zhang, Hongchuan	National University of Defense Technology
Shi, Junfeng	National University of Defense Technology
Cheng, Chuang	National University of Defense Technology
Chen, Bailiang	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Reinforcement Learning, AI-Enabled Robotics, Motion Control Abstract: Articulated tracked robots face significant challenges in maintaining stable locomotion over uneven terrain due to unknown contact points between tracks and ground, which are critical for dynamic control. Unlike legged robots where contact locations can be predicted, tracked systems require real-time adaptation to varying terrains. This paper presents C-TRAC, a terrain-adaptive control framework that integrates reinforcement learning with a contact-modeling variational autoencoder (C-VAE) to enable robust obstacle traversal. We first train a C-VAE in simulation to reconstruct high-fidelity contact information (position and binary probability) from noisy sensor measurements. This model learns a latent representation of terrain contacts, capturing complex interactions between the robot's kinematics and environment. Subsequently, we employ an asymmetric Soft Actor-Critic (SAC) algorithm to optimize a control policy that leverages the predicted contact data for adaptive track control during locomotion. Extensive experiments validate C-TRAC in both simulated and real-world scenarios. In benchmark tests against state-of-the-art (SOTA) methods using RoboCup Rescue Robot League environments, our approach achieves superior obstacle traversal speed (up to 66.67% faster on 45^{circ} staircase) and stability (up to 47.53% more stable on the oblique terrace) compared to contact-agnostic RL baselines and model-based methods. Notably, zero-shot sim-to-real transfer demonstrates consistent performance in unstructured outdoor ruins, also confirming the framework's practicality.


WeAT2	402
Modeling, Control, and Learning for Soft Robots 3	Regular Session
Chair: Tanaka, Kazutoshi	OMRON SINIC X Corporation
Co-Chair: Sun, Zhenglong	Chinese University of Hong Kong, Shenzhen

10:30-10:35, Paper WeAT2.1
Mobile Manipulator for Robotic Lacrosse: Learning to Pass the Ball

Huang, Xinchi	Stevens Institute of Technology
Mao, Yifan	Stevens Institute of Technology
Yang, Guang	Stevens Institute of Technology
Guo, Yi	Stevens Institute of Technology
Keywords: Mobile Manipulation, Modeling, Control, and Learning for Soft Robots, Planning, Scheduling and Coordination Abstract: This paper introduces the lacrosse mobile manipulator, a robotic system designed to play lacrosse. We focus on the task of ball passing between two robots, where challenges exist, including managing a small ball in the soft lacrosse head and interacting with a fast-moving ball. In this study, we develop innovative neural-network based learning approaches to enhance performance in dynamic environments. The robots are autonomously controlled in a decentralized manner. A combination of analytical physical-based and machine learning methods is employed to refine motion planner and ball landing predictions in both the throwing and catching processes. The system achieves satisfactory performance in real-world ball-passing experiments.

10:35-10:40, Paper WeAT2.2
Accelerating Inverse Kinematic Solutions for a Cable-Driven Soft Robotic Manipulator Via Physics-Informed Neural Network

Lin, Rui	Beihang University (BUAA)
He, Shuyou	Beihang University (BUAA)
Xu, Ming	Beihang University
Fu, Kangjia	Beihang University
Wu, Xuesong	National University of Defense Technology
Sun, Xiucong	Beihang University
Zhang, Qi	Defense Innovation Institute, Academy of Military Science
Yu, Sunquan	Defense Innovation Institute
Zhang, Xiang	Defense Innovation Institute
Keywords: Model Learning for Control, Modeling, Control, and Learning for Soft Robots Abstract: Cable-driven soft manipulators, with inherent compliance and hyper-redundancy, offer significant advantages in unstructured environments but present formidable challenges in modeling of inverse kinematics due to nonlinear deformations and underactuation. In this paper, building on a modified forward kinematic model, a physics-informed neural networks (PINN) framework based on spatiotemporal data is proposed for efficient inverse kinematics computation of cable-driven soft robotic manipulators. A geometrically exact forward kinematic model is constructed under the Piecewise Constant Curvature (PCC) assumption, extended to multi-section configurations, and enhanced by cable deflection compensation to account for practical routing constraints. Experimental validation shows a 40.11% reduction in end-effector positioning error (average 15.98 mm) when deflection effects are included. The proposed PINN architecture takes time and section count as inputs and outputs the corresponding manipulator configuration, enabling unified spatiotemporal trajectory tracking by minimizing elastic energy while satisfying kinematic constraints. Compared to particle swarm optimization (PSO), which requires iterative computation for each trajectory sample, the proposed method reduces computational time by over 71.9%, demonstrating superior efficiency in solving redundant inverse kinematics problems. This work bridges data-driven and mechanics-based approaches, offering a scalable solution for real-time control of soft manipulators.

10:40-10:45, Paper WeAT2.3
Unified Framework of Gradient Computation for Hybrid-Link System and Its Dynamical Simulation by Implicit Method

Ishigaki, Taiki	Tokyo University of Science
Ayusawa, Ko	National Institute of Advanced Industrial Science and Technology
Yamamoto, Ko	University of Tokyo
Keywords: Modeling, Control, and Learning for Soft Robots, Dynamics, Simulation and Animation Abstract: Flexible tools such as golf clubs and sports prostheses used for exercise are generally constructed from materials that are both strong and lightweight enough to withstand the weight and speed of human movement. The authors have previously proposed a hybrid-link system that integrates a rigid-link system with a flexible structure, achieving forward dynamics simulations of the hybrid-link system with floating base-link, such as a human and a humanoid. However, numerical simulations of models with stiff and light flexible structures diverge and are difficult to realize. Using implicit integration as a forward dynamics calculation method is expected to improve the stability of the calculation. However, it requires information on the gradient of the equations of motion． Therefore, this study extends the comprehensive dynamic gradient calculation method proposed for rigid-link systems to hybrid-link systems with floating base-link; moreover, a forward dynamics simulation with contact force calculation is realized using implicit integration in a robot arm and a humanoid with flexible structure.

10:45-10:50, Paper WeAT2.4
Pose Estimation of a Cable-Driven Serpentine Manipulator Utilizing Intrinsic Dynamics Via Physical Reservoir Computing

Tanaka, Kazutoshi	OMRON SINIC X Corporation
Takahashi, Tomoya	OMRON SINIC X Corporation
Hamaya, Masashi	OMRON SINIC X Corporation
Keywords: Modeling, Control, and Learning for Soft Robots, AI-Based Methods, Tendon/Wire Mechanism Abstract: Cable-driven serpentine manipulators hold great potential in unstructured environments, offering obstacle avoidance, multi-directional force application, and a lightweight design. By placing all motors and sensors at the base and employing plastic links, we can further reduce the arm's weight. To demonstrate this concept, we developed a 9-degree-of-freedom cable-driven serpentine manipulator with an arm length of 545 mm and a total mass of only 308 g. However, this design introduces flexibility-induced variations, such as cable slack, elongation, and link deformation. These variations result in discrepancies between analytical predictions and actual link positions, making pose estimation more challenging. To address this challenge, we propose a physical reservoir computing based pose estimation method that exploits the manipulator's intrinsic nonlinear dynamics as a high-dimensional reservoir. Experimental results show a mean pose error of 4.3 mm using our method, compared to 4.4 mm with a baseline long short-term memory network and 39.5 mm with an analytical approach. This work provides a new direction for control and perception strategies in lightweight cable-driven serpentine manipulators leveraging their intrinsic dynamics.

10:50-10:55, Paper WeAT2.5
Active Modeling and Compensation Control of Yoshimura Manipulator Using Koopman Operator

Qi, Jiaqing	Nankai University
Du, Jinyu	Nankai University
Zhang, Jingyu	Nankai University
Jiang, Tianyu	PLA General Hospital
Dang, Yu	Nankai University
Han, Jianda	Nankai University
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: The integration of origami structures into soft robotics has enriched the adaptability and functionality of the soft robots. Our research group has developed a cable-driven origami robot attached to an arc frame, which enables its deployment in an MR bore and manipulation of medical tools. However, the control of such origami robots still faces challenges such as nonlinear dynamics, unstructured environment, and external payload. This paper introduces an active modeling compensation control method using Koopman operator (K-AMCC) for the Yoshimura origami manipulator, enabling its accurate trajectory tracking with payloads under different orientations. This active modeling method exploits Koopman operator theory and Kalman filter to estimate the model error and synergies the linear quadratic regulator to compensate the modeling errors. The rectangular and circular trajectory tracking experiments under varying payloads and orientations were carried out. The results demonstrate the K-AMCC method's ability to improve trajectory accuracy significantly, which lays a solid foundation for further medical applications such as needle manipulation and laser ablation in an MR environment.

10:55-11:00, Paper WeAT2.6
Tensegrity Robot Proprioceptive State Estimation with Geometric Constraints

Tong, Wenzhe	University of Michigan, Ann Arbor
Lin, Tzu-Yuan	University of Michigan
Mi, Jonathan	University of Michigan, Ann Arbor
Jiang, Yicheng	University of Michigan-Ann Arbor
Ghaffari, Maani	University of Michigan
Huang, Xiaonan	University of Michigan
Keywords: Modeling, Control, and Learning for Soft Robots, Localization, Sensor Fusion Abstract: Tensegrity robots, characterized by a synergistic assembly of rigid rods and elastic cables, form robust structures that are resistant to impacts. However, this design introduces complexities in kinematics and dynamics, complicating control and state estimation. This work presents a novel proprioceptive state estimator for tensegrity robots. The estimator initially uses the geometric constraints of 3-bar prism tensegrity structures, combined with IMU and motor encoder measurements, to reconstruct the robot’s shape and orientation. It then employs a contact-aided invariant extended Kalman filter with forward kinematics to estimate the global position and orientation of the tensegrity robot. The state estimator’s accuracy is assessed against ground truth data in both simulated environments and real-world tensegrity robot applications. It achieves an average drift percentage of 4.2%, comparable to the state estimation performance of traditional rigid robots. This state estimator advances the state-of-the-art in tensegrity robot state estimation and has the potential to run in real-time using onboard sensors, paving the way for full autonomy of tensegrity robots in unstructured environments.

11:00-11:05, Paper WeAT2.7
Magnetic Continuum Robot with Modular Axial Magnetization: Design, Modeling, Optimization, and Control

Cao, Yanfei	The Chinese University of Hong Kong
Cai, Mingxue	The Chinese University of Hong Kong (CUHK), Shatin NT, Hong Kong
Sun, Bonan	The Chinese University of Hong Kong
Qi, Zhaoyang	The Chinese University of HongKong
Xue, Junnan	Harbin Institute of Technology, Shenzhen (HITSZ)
Yihang, Jiang	The Chinese University of Hong Kong
Hao, Bo	The Chinese University of Hong Kong
Zhu, Jiaqi	The Chinese University of Hong Kong
Liu, Xurui	The Chinese University of Hong Kong
Yang, Chaoyu	The Chinese University of Hong Kong
Zhang, Li	The Chinese University of Hong Kong
Keywords: Modeling, Control, and Learning for Soft Robots, Flexible Robots, Motion Control, Remote Magnetic Actuation Abstract: Magnetic continuum robots (MCRs) have become popular owing to their inherent advantages of easy miniaturization without requiring complicated transmission structures. The evolution of MCRs has significantly enhanced their dexterity. While much progress has been achieved, the quantitative index-based evaluation of deformability for different MPCs has not been addressed before. Here we use "deformability" to describe the capability for body deflection when an MCR forms different global shapes under an external magnetic field. Therefore, in this paper, we propose methodologies to design and control an MCR composed of modular axially magnetized segments. To guide robot MPC design, for the first time, we introduce a quantitative index-based evaluation strategy to analyze and optimize robot deformability. Additionally, a control framework with neural network-based controllers is developed to endow the robot with two control modes: the robot tip position and orientation (M1) and the global shape (M2). The excellent performance of the learnt controllers in terms of computation time and accuracy was validated via both simulation and experimental platforms.

11:05-11:10, Paper WeAT2.8
Efficient Real2Sim2Real of Continuum Robots Using Deep Reinforcement Learning with Koopman Operator (I)

Ji, Guanglin	The Chinese University of HongKong
Gao, Qian	The Chinese University of Hong Kong, Shenzhen
Xiao, Yin	The Chinese University of Hong Kong, Shenzhen
Sun, Zhenglong	Chinese University of Hong Kong, Shenzhen
Keywords: Modeling, Control, and Learning for Soft Robots, Reinforcement Learning, Machine Learning for Robot Control Abstract: Accurate control of continuum robots is challenging, especially in the presence of external disturbances. To address this issue, reinforcement learning (RL) has been increasingly investigated in continuum robot control due to its online policy updating capability. However, the gap of Sim2Real transfer caused by inaccurate modeling results in RL implementation a dilemma. In this article, we propose a safety-critical Real2Sim2Real online RL framework, where the Real2Sim transfer is first achieved by the identification of a continuum robot using the Koopman operator. We further improve the training efficiency by introducing an imperfect demonstration into the RL framework. The offline policy is trained in simulations and then tested on a real continuum robot platform. During tests, the tracking performance is influenced by the hysteresis effect that cannot be captured by the Koopman operator. This results in a millimeter-level tracking root mean square error (RMSE). To address this issue, we online update the policy as well as the model, and the RMSE of the online controller outperforms the offline controller by 89.16% in free space and 85.70% under external payload, respectively.


WeAT3	403
Soft Sensors and Actuators 1	Regular Session
Chair: Ren, Ziyu	Beihang University
Co-Chair: Meng, Wei	Wuhan University of Technology

10:30-10:35, Paper WeAT3.1
Stretchable and High-Precision Optical Tactile Sensor for Trajectory Tracking of Parallel Mechanisms

Nie, Yiding	Southern University of Science and Technology
Fan, Dongliang	Southern University of Science and Technology
Huang, Jiatai	Southern University of Science and Technology
Liu, Chunyu	Southern University of Science and Technology
Dai, Jian	School of Natural and Mathematical Sciences, King's College Lond
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing Abstract: Stretchable sensors indicate promising prospects for soft robotics, medical devices, and human-machine interactions due to the high compliance of soft materials. Discrete sensing strategies, including sensor arrays and distributed sensors, are broadly involved in tactile sensors across versatile applications. However, it remains a challenge to achieve high spatial resolution with a self-decoupled capacity of magnitude with position from tactile stimuli and insensitivity to other off-axis stimuli for stretchable tactile sensors. Herein, we develop a stretchable tactile sensor with a parallel-assembled structure based on the proposed continuous spectral filtering principle, allowing superhigh spatial and force resolution for applied stimuli. This proposed sensor enables a high-linear spatial response (R^2＞0.996) even during stretching and bending, and high continuous spatial (7 μm) and force (5 mN) resolutions with design scalability and interaction robustness. We further demonstrate the sensors' performance by integrating them into a five-bar parallel mechanism for precise trajectory tracking (rotational resolution: 0.02°) in real time.

10:35-10:40, Paper WeAT3.2
SMA-TENG Actuator with Tactile Sensing Capability

Zhang, Yiping	Beihang University
Liu, Zihe	Beihang University
Xu, Xi	Beihang University
Jin, Jiaqi	Beihang University
Yang, Boan	Beihang University
Wen, Li	Beihang University
Ren, Ziyu	Beihang University
Keywords: Soft Sensors and Actuators, Perception-Action Coupling Abstract: Shape memory alloy (SMA) is widely employed in developing actuators. However, the lack of sensing capabilities limits its application. This study presents a sensing-actuation integrated device based on SMA and triboelectric nanogenerator (TENG), achieving tactile sensing while maintaining the actuation performance. The proposed core-shell structure not only repurposes the SMA spring as a key component of actuation and sensing, but also effectively isolates the actuation current to prevent interference with the sensing signal. The aerogel-modified silicone composite layer is applied to the SMA to reduce temperature rise by 30.56%, ensuring the sensing performance. With a rapid response time of less than 31 ms and stable sensing performance exceeding 2000 cycles, the SMA-TENG actuator reliably detects dynamically varying forces and bending. Additionally, it generates a maximum actuation force of 3.21 N, which represents a 12.2% increase compared to a standard SMA spring, due to the pre-stress introduced by the composite layer. Moreover, it can actuate a displacement of 7.7 cm and exhibiting a power density of 7.15 × 10³ W/m³(at 0.84 V, 6 A). Finally, we validate its haptic sensing capability during actuation, demonstrating its potential towards interactive robotic systems.

10:40-10:45, Paper WeAT3.3
Tactile Sensing Soft Fingertip with Dual Air Bag Structure for an Anthropomorphic Robotic Hand

Yin, Jipeng	Nanjing University of Information Science and Technology
Wang, Yuchao	Nanjing University of Information Science and Technology
Liu, Yuwei	Nanjing University of Information Science and Technology
Wu, Haotian	Meituan Academy of Robotics Shenzhen
Mao, Yinian	Meituan-Dianping Group
Qiliang, Zhong	Meituan Academy of Robotics Shenzhen
Zhen, Ruichen	Meituan
Yang, Yang	Nanjing University of Information Science and Technology
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing, Soft Robot Materials and Design Abstract: Tactile sensing plays a crucial role to empower robotic hands with improved grasping and manipulating abilities. In this paper, we propose an anthropomorphic robotic hand design with dual air bag sensors integrated soft fingertips to achieve tactile sensing. The air bag sensor is low-cost, easy-to-build and deformable, and can be embedded in the fingertip, endows the hand with the ability to perceive and makes it have the mechanical complicance similar to the human fingertip. The air bag sensor exhibits high performance metrics, including a sensitivity of ~1.65 kPa/N, a minimum detection force of < 0.01 N, a response time of < 10 ms, and good stability and repeatability. The experimental results show that the proposed robotic hand performs well in surface texture detection, hard inclusion depth detection and object softness detection, as well as grasping tasks. By applying a machine learning algorithm to the experimental data, an accuracy of 0.767 and 0.898 was achieved in predicting hard inclusion depth and object hardness, respectively. This study provides a simple and effective tactile sensing solution for the design of anthropomorphic robotic hand, and may have possible applications such as end-effectors for humanoid robots or robotic palpation.

10:45-10:50, Paper WeAT3.4
Adaptive Cartesian Position Control with a Switching Strategy for Robotic Manipulator with Mixed Rigid-Elastic Joints

Hua, Minh Tuan	University of Agder
Gravdahl, Jan Tommy	Norwegian University of Science and Technology
Sanfilippo, Filippo	University of Agder
Keywords: Soft Sensors and Actuators, Robust/Adaptive Control, Flexible Robotics Abstract: With the increasing demand for safe and efficient human-robot interaction in industrial applications, robotic manipulators with mixed rigid-elastic joints have gained significant attention, yet their control remains challenging due to inherent parameter uncertainties and complex dynamics. In this paper, an adaptive Cartesian position control for robotic manipulators with mixed rigid-elastic joints is presented. Adaptive controllers are designed to deal with uncertainties in the parameters of the motors, while robust control signals effectively cope with uncertainties of the links and stiffness of the elastic joints. Furthermore, a switching strategy between Cartesian space position control and joint space position control when the end-effector comes into the vicinity of the target point is proposed. This switching strategy helps to keep the pose of the robotic manipulator stable when the end-effector has approached the target point. Simulation results on a 6-DOF robotic manipulator demonstrate that the proposed control scheme can achieve the desired accuracy in position and maintain a stable pose when the end-effector reaches the target.

10:50-10:55, Paper WeAT3.5
Optimized Optical Fiber Sensors for Forearm Muscle Deformation Monitoring and Hand Motion Recognition

Liu, Heifu	Wuhan University of Technology
Ai, Qingsong	Wuhan University of Technology
Peng, Nian	Wuhan University of Technology
Liu, Quan	Wuhan University of Technology
Meng, Wei	Wuhan University of Technology
Keywords: Soft Sensors and Actuators, Gesture, Posture and Facial Expressions, Intention Recognition Abstract: Hand motion monitoring plays a crucial role in fields such as human-machine interaction and rehabilitation training.Currently,electronic sensors are commonly used for hand motion monitoring.However,they are confronted with issues such as susceptibility to electromagnetic interference and sweat stains.Fiber Bragg Grating (FBG)sensors are small in size,highly sensitive,and possess good biocompatibility.In this paper,a flexible distributed Fiber Bragg Grating sensor is introduced.Emphasis is laid on the optimization and fabrication of the sensor,and performance tests are carried out on the fabricated sensor.To verify the potential of the sensor in hand motion monitoring,experiments are conducted.In the gesture recognition experiment,the Vision Transformer (ViT)model is utilized to classify eight types of gestures,and the final accuracy reaches 96.5%.In the wrist joint angle measurement experiment,the Pearson correlation coefficient between the physical angle and the measured angle is 0.985.In the grasping experiment,individual differences are reflected by the standard deviation during the grasping process.The experiments have demonstrated that the proposed sensor has the potential for monitoring hand motions.

10:55-11:00, Paper WeAT3.6
Antagonistic Physical-Virtual Framework for the Development of Soft Actuators

Pereira da Fonseca, Diogo Miguel	University of Coimbra
Neto, Pedro	University of Coimbra
Keywords: Soft Sensors and Actuators, Performance Evaluation and Benchmarking, Modeling, Control, and Learning for Soft Robots Abstract: Soft robots rely on soft actuators, whose nonlinear responses are challenging to model, simulate, and integrate into designs. This complexity hinders the development of advanced soft robots and rigid mechanisms that use soft actuators for compliant actuation. To address this, in this study we present a framework for model development and actuator integration that offers both real and virtual environments for testing and validation. This framework includes high-fidelity digital twins and a physical integration bench. The digital twin allows for the testing of virtual actuator models in a validated digital environment, replicating various load profiles. The physical integration bench provides a safe, reproducible platform for validating both models and controllers under different loads, load profiles, and inertial conditions. Together, these tools enable rapid and reliable validation of actuator models and controllers, accelerating the development cycle of complex soft robots. We conclude by demonstrating the proposed workflow using liquid-gas phase transition actuators as a demonstrative test subject.

11:00-11:05, Paper WeAT3.7
Osmosis-Driven Large-Scale Actuation for Shape-Shifting Mechanisms

Challita, Elio	Harvard University
Chen, Tony G.	Harvard University
Zoll, Rachel	Harvard University
Yuen, Michelle C.	Harvard University
Wood, Robert	Harvard University
Keywords: Soft Sensors and Actuators, Actuation and Joint Mechanisms, Mechanism Design Abstract: Osmosis-driven actuation offers a promising strategy for developing untethered, environmentally responsive soft and shape shifting mechanisms and robots. In this work, we explore the use of superabsorbent polymer (SAP) pellets as large-scale, shape-morphing actuators. Upon exposure to water, these approximately 2 mm diameter spherical pellets undergo a dramatic volumetric expansion, up to 300 times their initial volume, generating actuation forces of approximately 10 N under constrained conditions. We further demonstrate reversible cyclic actuation via controlled swelling-deswelling using ethanol-water solutions. Finally, we integrate these systems into a shape-morphing wheel design to enable adaptive locomotion that passively transitions between terrestrial and aquatic environments. Our findings demonstrate SAP-based osmotic actuators as an environmentally-driven solution for soft robotics, and shape-shifting soft hybrid mechanisms.

11:05-11:10, Paper WeAT3.8
A Variable Sensing Range Electrical Impedance Tomography Sensor for Robot Electric Skins

Wang, Zhanwei	Vrije Universiteit Brussel
Huaijin, Chen	Vub
Wang, Ke	Vrije Universiteit Brussel
Vanderborght, Bram	Vrije Universiteit Brussel
Terryn, Seppe	Vrije Universiteit Brussel (VUB)
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing Abstract: While stretchable and compressible sensors are commonly used to enhance the proprioception and exteroception of soft robots, their sensitivity and sensing range are constrained by their design and stiffness, often limiting their measurement capabilities. To overcome this limitation, this paper presents a novel approach that integrates stiffness modulation with piezoresistive sensing, specifically Electrical Impedance Tomography (EIT). This approach results in a new type of artificial skin, termed Variable Sensing Range Electrical Impedance Tomography Skin (VEITS), which features an adjustable sensing range. The VEITS comprises two layers: a top layer with variable stiffness achieved through granular jamming and a bottom EIT sensing layer. By applying a vacuum to the granular layer, the sensor stiffness can be increased, enabling it to measure higher contact forces. Specifically, applying a vacuum pressure of -80 kPa extends the contact force range by 37% compared to the unpressurized state. While increased stiffness temporarily reduce sensitivity to small loads, it does not affect the EIT localization accuracy. The VEITS simple structure makes it suitable for large surface sensing on robots, and effectively expanding the sensing range. Though demonstrated with EIT and granular-jamming stiffness modulation, this novel principle is applicable to other strain-based sensors and stiffness modulation techniques.


WeAT4	404
Surgical Robotics	Regular Session
Chair: Wang, Hongpeng	Nankai University
Co-Chair: Qin, Yanding	Nankai University

10:30-10:35, Paper WeAT4.1
Robotic Arm Platform for Multi-View Image Acquisition and 3D Reconstruction in Minimally Invasive Surgery

Saikia, Alexander	University College London
Di Vece, Chiara	University College London
Bonilla, Sierra	University College London
He, Chloe	UCL
Magbagbeola, Morenike	University College London
Mennillo, Laurent Adrien	University College London
Czempiel, Tobias	University College London
Bano, Sophia	University College London
Stoyanov, Danail	University College London
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Minimally invasive surgery (MIS) offers significant benefits, such as reduced recovery time and minimised patient trauma, but poses challenges in visibility and access, making accurate 3D reconstruction a significant tool in surgical planning and navigation. This work introduces a robotic arm platform for efficient multi-view image acquisition and precise 3D reconstruction in MIS settings. We adapted a laparoscope to a robotic arm and captured ex-vivo images of several ovine organs across varying lighting conditions (operating room and laparoscopic) and trajectories (spherical and laparoscopic). We employed recently released learning-based feature matchers combined with COLMAP to produce our reconstructions. The reconstructions were evaluated against high-precision laser scans for quantitative evaluation. Our results show that whilst reconstructions suffer most under realistic MIS lighting and trajectory, two matching methods achieve close to sub-millimetre accuracy with 0.80 and 0.76mm Chamfer distances and 1.06 and 0.98mm RMSEs for ALIKED and GIM respectively. Our best reconstruction results occur with operating room lighting and spherical trajectories. Our robotic platform provides a tool for controlled, repeatable multi-view data acquisition for 3D generation in MIS environments, which can lead to new datasets necessary for novel learning-based surgical models.

10:35-10:40, Paper WeAT4.2
SurgEM: A Vision-Based Surgery Environment Modeling Framework for Constructing a Digital Twin Toward Autonomous Soft Tissue Manipulation

Chen, Jiahe	The University of Tokyo
Kobayashi, Etsuko	The University of Tokyo
Sakuma, Ichiro	The University of Tokyo
Tomii, Naoki	The University of Tokyo
Keywords: Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics, Computer Vision for Automation Abstract: Autonomous soft tissue manipulation in robotic surgery remains challenging. Modeling the tool-tissue interaction, analyzing the tissue structural deformation and monitoring the biomechanical status during surgery manipulation may benefit the development of autonomous surgery; however, there are currently inadequate studies. We propose a vision-based surgery environment modeling framework to simultaneously reconstruct and track the forceps and the tissue, leveraging model-based pose estimation and scene flow-based mesh optimization. We also propose a digital twin based on the framework for continuously modeling the tool-tissue interaction and monitoring the deformation and strain of the tissue surface. Quantitative and qualitative ex vivo experiments were conducted to evaluate the proposed system from various perspectives. Results show that the deformation recovery accuracy is 0.38±0.30 mm with robustness to occlusion; the instrument pose estimation accuracy is 0.85±0.57 degree in rotation and 2.09±1.41 mm in translation. The relative positioning between the tissue and the forceps can be correctly modeled in terms of contact detection. The system also correctly reveals the differences in strain distributions in two types of tool-tissue interaction, palpation and traction. With the proposed system, surgical robot systems with the perception of tissue deformation can be developed in the future for optimized and autonomous surgery.

10:40-10:45, Paper WeAT4.3
Collaborative Preoperative Planning for Operation-Navigation Dual-Robot Orthopedic Surgery System (I)

Qin, Yanding	Nankai University
Geng, Pengxiu	Nankai University
You, Yugen	Shantou University
Ma, Mingqian	Nankai University
Wang, Hongpeng	Nankai University
Han, Jianda	Nankai University
Keywords: Surgical Robotics: Planning, Medical Robots and Systems Abstract: Intraoperative optical navigation is widely utilized in robotic surgery systems. Typically, the observation pose of the optical tracking system (OTS) is manually adjusted and then fixed throughout the surgery. However, fixed OTS suffers from limited measurement volume (MV) and visual interferences, making consistent navigation challenging in clinics. In this paper, an operation-navigation dual-robot collaborative system is proposed for orthopedic surgeries. An extra navigation robot is introduced to actively adjust the observation pose of the OTS. A collaborative preoperative planning method is proposed for this dual-robot system, including osteotomy path planning of the operation robot and collaborative planning of the navigation robot. Firstly, osteotomy paths of the operation robot are generated according to the surgery regulations and the geometric features of the vertebral foramen. Secondly, based on the generated osteotomy paths, the collaborative planning of the navigation robot is formulated into a multi-objective optimization problem to find the optimal poses of the OTS for each osteotomy plane. Compared with fixed OTS, active navigation is capable of keeping all the targets within the MV of the OTS throughout the surgery. Semi-laminectomy on a human spine phantom is adopted as an example to experimentally evaluate the effectiveness of the proposed method.

10:45-10:50, Paper WeAT4.4
From Decision to Action in Surgical Autonomy: Multi-Modal Large Language Models for Robot-Assisted Blood Suction

Zargarzadeh, Sadra	University of Alberta
Mirzaei, Maryam	University of Alberta
Ou, Yafei	University of Alberta
Tavakoli, Mahdi	University of Alberta
Keywords: Surgical Robotics: Planning, Task Planning, Surgical Robotics: Laparoscopy Abstract: The rise of Large Language Models (LLMs) has impacted research in robotics and automation. While progress has been made in integrating LLMs into general robotics tasks, a noticeable void persists in their adoption in more specific domains such as surgery, where critical factors such as reasoning, explainability, and safety are paramount. Achieving autonomy in robotic surgery, which entails the ability to reason and adapt to changes in the environment, remains a significant challenge. In this work, we propose a multi-modal LLM integration in robot-assisted surgery for autonomous blood suction. The reasoning and prioritization are delegated to the higher-level task-planning LLM, and the motion planning and execution are handled by the lower-level deep reinforcement learning model, creating a distributed agency between the two components. As surgical operations are highly dynamic and may encounter unforeseen circumstances, blood clots and active bleeding were introduced to influence decision-making. Results showed that using a multi-modal LLM as a higher-level reasoning unit can account for these surgical complexities to achieve a level of reasoning previously unattainable in robot-assisted surgeries. These findings demonstrate the potential of multi-modal LLMs to significantly enhance contextual understanding and decision-making in robotic-assisted surgeries, marking a step toward autonomous surgical systems.

10:50-10:55, Paper WeAT4.5
Modeling and Compensation of Stiffness-Dependent Hysteresis for Stiffness-Tunable Tendon-Sheath Mechanism in Flexible Endoscopic Robots (I)

Gao, Huxin	National University of Singapore
Hao, Ruoyi	The Chinese University of Hong Kong
Yang, Xiaoxiao	Qilu Hospital of Shandong University
Li, Changsheng	Beijing Institute of Technology
Zhang, Zedong	National University of Singapore
Zuo, Xiuli	QiluHospitalofShandongUniversity
Li, Yanqing	Qilu Hospital of Shandong University
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism Abstract: Robot-assisted gastrointestinal endoscopic surgery (GES) requires flexible manipulators to possess a compact dimension and stiffness tuning capability. Current stiffness-tunable miniature manipulators (STMM) using tendon-sheath mechanism (TSM) experience the problem of stiffness-influenced hysteresis. To address this, we propose the first stiffness-dependent Modified Generalized Prandtl-Ishlinski (MGPI) model and a specific compensation strategy. Firstly, we analyzed the stiffness tuning mechanism and extracted the stiffness parameter. Based on this, the analytical hysteresis model and its inverse function were built and verified in simulations and experiments, which increases the fitting accuracy of the hysteresis. Then, with the assistance of the compensation strategy, the real-time input-output relationship of the STMM's stiffness-tunable joints can be approximately linear. The average errors of trajectory tracking achieve a significant improvement of over 85%. This work provides a new method to model and compensate for the dynamic hysteresis in flexible endoscopic robots with the stiffness-tunable TSM.

10:55-11:00, Paper WeAT4.6
Safe Start Regions for Medical Steerable Needle Automation

Hoelscher, Janine	Clemson
Fried, Inbar	University of North Carolina at Chapel Hill
Tsalikis, Spiros	University of North Carolina at Chapel Hill
Akulian, Jason	University of North Carolina at Chapel Hill
Webster III, Robert James	Vanderbilt University
Alterovitz, Ron	University of North Carolina at Chapel Hill
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Motion and Path Planning, Nonholonomic Motion Planning Abstract: Steerable needles are minimally invasive devices that enable novel medical procedures by following curved paths to avoid critical anatomical obstacles. We introduce a new start pose robustness metric for steerable needle motion plans. A steerable needle deployment typically consists of a physician manually placing a steerable needle at a precomputed start pose on the surface of tissue and handing off control to a robot, which then autonomously steers the needle through the tissue to the target. The handoff between humans and robots is critical for procedure success, as even small deviations from a planned start pose change the steerable needle's workspace. Our metric is based on a novel geometric method to efficiently compute how far the physician can deviate from the planned start pose in both position and orientation such that the steerable needle can still reach the target. We evaluate our metric through simulation in liver and lung scenarios. Our evaluation shows that our metric can be applied to plans computed by different steerable needle motion planners and that it can be used to efficiently select plans with large safe start regions.

11:00-11:05, Paper WeAT4.7
A Safety-Enhanced Autonomous Resection Method for Precision Laparoscopic Surgery Amid Tissue Deformation

Shi, Yudong	Hefei University of Technology
Mo, Hangjie	City University of HongKong
Xiao, Xilin	Hefei University of Technology
Duan, Ruiming	Hefei University of Technology
Li, Ling	Hefei University of Technology
Li, Xiaojian	Hefei University of Technology
Keywords: Robot Safety, Surgical Robotics: Planning, Surgical Robotics: Laparoscopy Abstract: Resection of pathological tissue is a common procedure in surgical oncology for treating tumors. In robot-assisted electrosurgery, the use of predefined markers to guide autonomous robotic resection is gaining traction. Accurate tracking of these markers and minimizing electrocautery damage are critical for the safe and effective autonomous resection of tumors. This paper introduces a safety enhanced autonomous resection method for laparoscopic surgery, designed to mitigate the risks posed by tissue deformation during the resection process. Initially, we pre-plan the cutting path and design a switching strategy for navigation waypoints based on a preview tracking mechanism. Then, we develop a depth-fused navigation controller and a safe withdrawal motion controller. Next, an inertial tracking mechanism is established to evaluate tissue deformation over short periods. Finally, we develop a confidence generator to fuse the two controllers, ensuring that tissue deformation during the resection process does not cause additional electrocautery damage. Simulation and phantom experiments were conducted, demonstrating the effectiveness of our proposed method. This work represents a significant step toward achieving autonomous robotic resection.


WeAT5	407
Kinematics, Planning and Control 1	Regular Session
Chair: Zhang, Yongzhou	Karlsruhe University of Applied Sciences

10:30-10:35, Paper WeAT5.1
ETA-IK: Execution-Time-Aware Inverse Kinematics for Dual-Arm Systems

Tang, Yucheng	University of Applied Sciences Karlsruhe
Huang, Xi	Karlsruhe Institute of Technology
Zhang, Yongzhou	Karlsruhe University of Applied Sciences
Chen, Tao	Karlsruher Institut Für Technologie
Mamaev, Ilshat	Proximity Robotics & Automation GmbH
Hein, Björn	Karlsruhe University of Applied Sciences
Keywords: Dual Arm Manipulation, Kinematics, Optimization and Optimal Control Abstract: This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of the entire system, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional IK methods using surrogate metrics, our approach directly optimizes execution time while implicitly considering collisions. A neural network based execution time approximator is employed to predict time-efficient joint configurations while accounting for potential collisions. Through experimental evaluation on a system composed of a UR5 and a KUKA iiwa robot, we demonstrate significant reductions in execution time. The proposed method outperforms conventional approaches, showing improved motion efficiency without sacrificing positioning accuracy.

10:35-10:40, Paper WeAT5.2
LBE-DDIK: Is One Model Good Enough to Learn-By-Example the Inverse Kinematics of Multiple Serial Robots?

Demby's, Jacket	University of Missouri-Columbia
Farag, Ramy	University of Missouri-Columbia
Tousi, Seyed Mohamad Ali	University of Missouri-Columbia
Omotara, Gbenga	University of Missouri-Columbia
DeSouza, Guilherme	University of Missouri-Columbia
Keywords: Kinematics, Machine Learning for Robot Control, Redundant Robots Abstract: Data-Driven Inverse Kinematics (DDIK) solvers emerged as promising Inverse Kinematics (IK) methods for reliably approximating the IK of robotic manipulators. However, these solvers remain heavily robot-dependent, where for each robot of interest, a network needs to be trained in an one-solver-one-robot framework. In this paper, we build on our previous work on Learning-By-Example (LBE) for DDIK, and introduce an one-solver-many-robots framework; where a single neural network is used to predict the IK of multiple robots -- mainly with 6 and 7 Degrees of Freedom (DoF). In our LBE approach, the neural network input includes an example of joint-pose tuple (e.g. any previous joint and corresponding pose tuple in the path) along with the queried pose as the same network outputs the desired robot joint configuration. Here, we investigate five network architectures: a Plain Multilayer Perceptron (MLP), a Residual-based MLP (RMLP), a Densely Connected MLP (DMLP), and two transformers inspired by Generative Pre-trained Transformer (GPT) and tested them using 3 diverse datasets with 20 real-world robotic arms with 6 and 7 DoF. Our experimental results demonstrate that a single lightweight, LBE-based DDIK solver can reliably predict the IK for multiple and hitherto unseen robots, within each of the 6 or 7DoF family as well as across both 6 and 7DoF robot families with position errors below 1mm and orientation errors below 1deg. Additionally, we compare all proposed LBE-based DDIK solvers with three established numerical IK solvers: Selectively Damped Least-Squares (SD), Singular Value Filtering (SVF), and Mixed Inverse (MX) and observe that our LBE-based DDIK solvers achieve comparable accuracy, with the advantage of being a one-solver-many-robots framework.

10:40-10:45, Paper WeAT5.3
A Nonlinear Filter for Pose Estimation Based on Fast Unscented Transform on Lie Groups

Jin, Yuqiang	Zhejiang University of Technology
Zhang, Wen-An	Zhejiang University of Technology, China
Tang, Jiawei	Hong Kong University of Science and Technology
Sun, Hu	Zhejiang University of Technology
Shi, Ling	The Hong Kong University of Science and Technology
Keywords: Kinematics, Probability and Statistical Methods, Localization Abstract: This article presents a nonlinear estimator on matrix Lie group that performs a fast unscented transformation with natural evolution of sigma points from a geometric perspective. Different from the existing methods, the proposed method preserves the original dynamic equations on the manifold, which greatly reduces the computational time without changing the system configuration space or reducing the number of sigma points. We provide a new state propagation and update method of UKF on manifolds, where only the mean state is involved, and the remaining sigma points are calculated and propagated as incremental information based on the state of the previous step, according to the fundamental property of geometric filtering on the Lie group. Moreover, by decoupling the parameter variables, we investigate the upper limit of the efficiency improvement of the proposed algorithm compared to the traditional unscented transformation in different situations. Finally, two representative experiments are conducted to validate the proposed theory, the experiments show that the proposed method achieves desirable performance with much higher computational efficiency as compared with the existing UKF algorithms on manifolds.

10:45-10:50, Paper WeAT5.4
Co-Optimization of Tool Orientations, Kinematic Redundancy, and Waypoint Timing for Robot-Assisted Manufacturing (I)

Chen, Yongxue	The University of Manchester
Zhang, Tianyu	The University of Manchester
Huang, Yuming	University of Manchester
Liu, Tao	University of Manchester
Wang, Charlie C.L.	The University of Manchester
Keywords: Kinematics, Additive Manufacturing, Industrial Robots Abstract: In this paper, we present a concurrent and scalable trajectory optimization method to improve the quality of robot-assisted manufacturing. Our method simultaneously optimizes tool orientations, kinematic redundancy, and waypoint timing on input toolpaths with large numbers of waypoints to improve kinematic smoothness while incorporating manufacturing constraints. Differently, existing methods always determine them in a decoupled manner. To deal with the large number of waypoints on a toolpath, we propose a decomposition-based numerical scheme to optimize the trajectory in an out-of-core manner, which can also run in parallel to improve the efficiency. Simulations and physical experiments have been conducted to demonstrate the performance of our method in examples of robot-assisted additive manufacturing.

10:50-10:55, Paper WeAT5.5
New Kinematic Control Scheme for Redundant Robotic Manipulators Perturbed by Harmonic Noise (I)

Zhang, Xiyuan	Hainan University
Zhang, Zhonghao	Hainan University
Guo, Dongsheng	Hainan University
Zhang, Weidong	Shanghai JiaoTong University
Li, Weibing	Sun Yat-Sen University
Li, Shuai	University of Oulu
Keywords: Kinematics, Robust/Adaptive Control, Redundant Robots Abstract: Kinematic control is one of the fundamental issues in the study of redundant robotic manipulators, and various schemes that utilize the pseudoinverse of the Jacobian have been crafted in this field. However, these schemes may be ineffective due to the lack of suppressing additive noise, especially the harmonic noise widely encountered in practice. This article proposes a new kinematic control scheme for redundant robotic manipulators affected by harmonic noise, aiming to overcome the identified limitations. Such a scheme, which incorporates the adaptive learning mechanism based on harmonic frequency, can simulate the harmonic noise, suppress its disturbance, and eventually realize the effective control purpose. It is then theoretically proven that the Cartesian error generated by the proposed scheme exhibits the convergence property, thus guaranteeing the control performance on redundant robotic manipulators even in the presence of harmonic noise. Simulation and experiment results under PA10 and Panda robotic manipulators further verify the efficacy and superiority of the proposed kinematic control scheme.

10:55-11:00, Paper WeAT5.6
Visibility-Aware RRT* for Safety-Critical Navigation of Perception-Limited Robots in Unknown Environments

Kim, Taekyung	University of Michigan
Panagou, Dimitra	University of Michigan, Ann Arbor
Keywords: Integrated Planning and Control, Collision Avoidance, Constrained Motion Planning Abstract: Safe autonomous navigation in unknown environments remains a critical challenge for robots with limited sensing capabilities. While safety-critical control techniques, such as Control Barrier Functions (CBFs), have been proposed to ensure safety, their effectiveness relies on the assumption that the robot has complete knowledge of its surroundings. In reality, robots often operate with restricted field-of-view and finite sensing range, which can lead to collisions with unknown obstacles if the planner is agnostic to these limitations. To address this issue, we introduce the Visibility-Aware RRT* algorithm that combines sampling-based planning with CBFs to generate safe and efficient global reference paths in partially unknown environments. The algorithm incorporates a collision avoidance CBF and a novel visibility CBF, which guarantees that the robot remains within locally collision-free regions, enabling timely detection and avoidance of unknown obstacles. We conduct extensive experiments interfacing the path planners with two different safety-critical controllers, wherein our method outperforms all other compared baselines across both safety and efficiency aspects.

11:00-11:05, Paper WeAT5.7
A Learning-Based Hybrid Artificial Bee Colony Algorithm for Energy-Efficient Distributed Heterogeneous Type-2 Fuzzy Welding Shop Scheduling Problem with Factory Eligibility (I)

Yu, Fei	Huazhong University of Science and Technology
Gao, Liang	Huazhong Univ. of Sci. & Tech
Lu, Chao	China University of Geosciences (Wuhan)
Yin, Lvjiang	Hubei University of Automotive Technology
Keywords: Planning, Scheduling and Coordination, Sustainable Production and Service Automation Abstract: In the era of economic globalization, the distributed heterogeneous welding shop scheduling problem (DHWSP) has been considered. Meanwhile, in some actual production scenarios, some jobs can only be processed in certain designated factories (i.e., factory eligibility), and the inevitable uncontrollable system disturbances (e.g., machine maintenance and human factors) lead to uncertain processing time for jobs. However, the research considering uncertain processing time in DHWSP with factory eligibility remains unexplored. Considering the advantages of interval type-2 fuzzy number (IT2FN) in representing the high level of uncertainty, the concept of IT2FN is introduced to tackle uncertain processing time. Then, under the context of green manufacturing, this paper investigates an energy-efficient distributed heterogeneous type-2 fuzzy welding shop scheduling problem with factory eligibility (EDHFWSP-FE). A learning-based hybrid artificial bee colony algorithm (LHABC) is designed to minimize both total energy consumption and makespan in EDHFWSP-FE. Within LHABC, a cooperative initialization is presented to create excellent initial solutions. In employed bee phase, a Q-learning based method is developed to help solutions select a superior neighborhood structure. In onlooker bee phase, a variable neighborhood search (VNS) is proposed to excavate promising neighborhood solutions. In scout bee phase, an estimation of distribution algorithm (EDA) based method is devised to generate new excellent solutions. Finally, experimental results on 27 test instances demonstrate that LHABC outperforms five other multi-objective optimization algorithms.


WeAT6	301
Deep Learning in Grasping and Manipulation 1	Regular Session
Chair: Wang, Chaoqun	Shandong University
Co-Chair: Liu, Xing	Northwestern Polytechnical University

10:30-10:35, Paper WeAT6.1
Refer and Grasp: Vision-Language Guided Continuous Dexterous Grasping

Huang, Yayu	Institute of Automation, Chinese Academy of Sciences
Fan, Dongxuan	Chinese Academy of Sciences
Qi, Wen	University of Chinese Academy of Sciences
Li, Daheng	University of Chinese Academy of Sciences
Yang, Yifan	Institute of Automation, Chinese Academy of Sciences
Luo, Yongkang	Institute of Automation, Chinese Acdamy of Sciences
Sun, Jia	Institute of Automation, Chinese Acdamy of Sciences
Liu, Qian	Institute of Automation, Chinese Academy of Sciences
Wang, Peng	Chinese Acdamy of Sciences
Keywords: Deep Learning in Grasping and Manipulation, Data Sets for Robot Learning, Grasping Abstract: Robotic grasping guided by natural language instructions faces challenges due to ambiguities in object descriptions and the need to interpret complex spatial context. Existing visual grounding methods often rely on datasets that fail to capture these complexities, particularly when object categories are vague or undefined. To address these challenges, we make three key contributions. First, we present an automated dataset generation engine for visual grounding in tabletop grasping, combining procedural scene synthesis with template-based referring expression generation, requiring no manual labeling. Second, we introduce the RefGrasp dataset, featuring diverse indoor environments and linguistically challenging expressions for robotic grasping tasks. Third, we propose a visually grounded dexterous grasping framework with continuous grasp generation, validated through extensive real-world robotic experiments. Our work offers a novel approach for language-guided robotic manipulation, providing both a challenging dataset and an effective grasping framework for real-world applications.

10:35-10:40, Paper WeAT6.2
DB-MPO: Demonstration Boosted Reactive Grasping for Two-Finger Gripper

Zhang, Boya	University of Tübingen
Zell, Andreas	University of Tübingen
Martius, Georg	Max Planck Institute for Intelligent Systems
Keywords: Reinforcement Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation Abstract: Prior knowledge vastly exists in the automation industry, especially for tasks like pick-and-place, where simple programmatic demonstrations with online generation ability can be acquired easily. How to learn a policy faster with higher flexibility and generalization ability based on these demonstrations is a question to be answered. End-to-end target learning and imitation learning are widely discussed in previous works. Here, we focus on the online generation ability of the demonstration and propose a demo injection method based on actor-critic off-policy reinforcement learning (RL) for the interaction and policy optimization phase. We conduct experiments and an ablation study based on four research questions around a two-finger reactive grasping task with a Panda robot. The result shows our proposed injection method increases the training stability, strongly reduces the time to convergence and benefits sim-2-real transfer with smooth motion.

10:40-10:45, Paper WeAT6.3
FlingFlow: LLM-Driven Dynamic Strategies for Efficienct Cloth Flattening

Fu, Tianyu	Shandong University
Li, Cheng	Shandong University
Liu, Jin	Shandong University
Li, Fengming	Shandong Jianzhu University
Wang, Chaoqun	Shandong University
Song, Rui	Shandong University
Keywords: AI-Enabled Robotics, Manipulation Planning, Deep Learning Methods Abstract: The proficiency of robots in cloth manipulation is crucial for their potential widespread deployment in household service contexts, with the task of unfolding cloth being particularly indispensable. Unlike rigid objects, cloth has a high-dimensional state space, which poses significant challenges for robotic operations. This paper presents a robotic framework that integrates dynamic and static operations for cloth unfolding. Dynamic operations are introduced in a single-arm scenario, employing gravity to expedite flattening. Initially, we define the classification of cloth states and operational skills. Subsequently, in skill selection, a Large Language Model (LLM) is utilized to make decisions based on the current state, selecting skills appropriate for the given situation. For the determination of operation points, a cloth region segmentation network extracts key features of the cloth, and the final operation points are determined through geometric analysis of the masks. Experiments on a real robot demonstrate that our method can successfully unfold cloths of various initial conditions, colors, sizes, textures, shapes and materials, achieving over 95% coverage - defined as the ratio of the current area of the fabric to its fully expanded area - thereby proving the effectiveness of the combined dynamic and static operation strategy. Furthermore, this method significantly enhances the efficiency of cloth unfolding, completing the task within ten actions, whereas other methods require dozens of operations, greatly reducing the required operational complexity.

10:45-10:50, Paper WeAT6.4
Learning Goal-Directed Object Pushing in Cluttered Scenes with Location-Based Attention

Dengler, Nils	University of Bonn
Del Aguila Ferrandis, Juan	The University of Edinburgh
Moura, Joao	The University of Edinburgh
Vijayakumar, Sethu	University of Edinburgh
Bennewitz, Maren	University of Bonn
Keywords: Deep Learning in Grasping and Manipulation Abstract: In complex scenarios where typical pick-and-place techniques are insufficient, often non-prehensile manipulation can ensure that a robot is able to fulfill its task. However, non-prehensile manipulation is challenging due to its underactuated nature with hybrid-dynamics, where a robot needs to reason about an object's long-term behavior and contact-switching, while being robust to contact uncertainty. The presence of clutter in the workspace further complicates this task, introducing the need to include more advanced spatial analysis to avoid unwanted collisions. Building upon prior work on reinforcement learning with multimodal categorical exploration for planar pushing, we propose to incorporate location-based attention to enable robust manipulation in cluttered scenes. Unlike previous approaches addressing this obstacle avoiding pushing task, our framework requires no predefined global paths and considers the desired target orientation of the manipulated object. Experimental results in simulation as well as with a real KUKA iiwa robot arm demonstrate that our learned policy manipulates objects successfully while avoiding collisions through complex obstacle configurations, including dynamic obstacles, to reach the desired target pose.

10:50-10:55, Paper WeAT6.5
Learning Gentle Grasping Using Vision, Sound, and Touch

Nakahara, Ken	TU Dresden
Calandra, Roberto	TU Dresden
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Multifingered Hands Abstract: In our daily life, we often encounter objects that are fragile and can be damaged by excessive grasping force, such as fruits. For these objects, it is paramount to grasp gently—not using the maximum amount of force possible, but rather the minimum amount of force necessary. This paper proposes using visual, tactile, and auditory signals to learn to grasp and regrasp objects stably and gently. Specifically, we use audio signals as an indicator of gentleness during the grasping, and then train an end-to-end action-conditional model from raw visuo-tactile inputs that predicts both the stability and the gentleness of future grasping candidates, thus allowing the selection and execution of the most promising action. Experimental results on a multi-fingered hand over 1,500 grasping trials demonstrated that our model is useful for gentle grasping by validating the predictive performance (3.27% higher accuracy than the vision-only variant) and providing interpretations of their behavior. Finally, real-world experiments confirmed that the grasping performance with the trained multi-modal model outperformed other baselines (17% higher rate for stable and gentle grasps than vision-only). Our approach requires neither tactile sensor calibration nor analytical force modeling, drastically reducing the engineering effort to grasp fragile objects. Dataset and videos are available at https://lasr.org/research/gentle-grasping.

10:55-11:00, Paper WeAT6.6
GraphGarment: Learning Garment Dynamics for Bimanual Cloth Manipulation Tasks

Chen, Wei	Imperial College London
Li, Kelin	Imperial College London
Lee, Dongmyoung	Imperial College London
Chen, Xiaoshuai	Imperial College London
Zong, Rui	Imperial College London
Kormushev, Petar	Imperial College London
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Service Robotics Abstract: Physical manipulation of garments is often crucial when performing fabric-related tasks, such as hanging garments. However, due to the deformable nature of fabrics, these operations remain a significant challenge for robots in household, healthcare, and industrial environments. In this paper, we propose GraphGarment, a novel approach that models garment dynamics based on robot control inputs and applies the learned dynamics model to facilitate garment manipulation tasks such as hanging. Specifically, we use graphs to represent the interactions between the robot end-effector and the garment. GraphGarment uses a graph neural network (GNN) to learn a dynamics model that can predict the next garment state given the current state and input action in simulation. To address the substantial sim-to-real gap, we propose a residual model that compensates for garment state prediction errors, thereby improving real-world performance. The garment dynamics model is then applied to a model-based action sampling strategy, where it is utilized to manipulate the garment to a reference pre-hanging configuration for garment-hanging tasks. We conducted four experiments using six types of garments to validate our approach in both simulation and real-world settings. In simulation experiments, GraphGarment achieves better garment state prediction performance, with a prediction error 0.46 cm lower than the best baseline. Our approach also demonstrates improved performance in the garment-hanging simulation experiment—with enhancements of 12%, 24%, and 10%, respectively. Moreover, real-world robot experiments confirm the robustness of sim-to- real transfer, with an error increase of 0.17 cm compared to simulation results. Supplementary material is available at: https://sites.google.com/view/graphgarment.

11:00-11:05, Paper WeAT6.7
RoboEngine: Plug-And-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

Yuan, Chengbo	Tsinghua University
Joshi, Suraj	Tsinghua University
Zhu, Shaoting	Tsinghua University
Su, Hang	Tsinghua Univiersity
Zhao, Hang	Tsinghua University
Gao, Yang	Tsinghua University
Keywords: Deep Learning in Grasping and Manipulation, AI-Based Methods, Deep Learning Methods Abstract: Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit are released https://roboengine.github.io/.

11:05-11:10, Paper WeAT6.8
Efficient Reinforcement Learning Method for Multi-Phase Robot Manipulation Skill Acquisition Via Human Knowledge, Model-Based and Model-Free Methods (I)

Liu, Xing	Northwestern Polytechnical University
Liu, Zihao	Northwestern Polytechnical University
Wang, Gaozhao	Northwestern Polytechnical University
Liu, Zhengxiong	Northwestern Polytechnical University
Huang, Panfeng	Northwestern Polytechnical University
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Manipulation Planning Abstract: A novel efficient reinforcement learning paradigm combining human knowledge, model-based and model-free methods is presented for optimal robot manipulation control during complex multi-phase robot manipulation tasks, e.g., the pegin-hole tasks with tight fit and nut-and-bolt assembly. Firstly, human demonstration is conducted to collect the data during successful robot manipulation, and manipulation phase estimation method integrating with human knowledge is presented to obtain the higher-level planning of the multi-phase robot manipulation tasks. Typical robot manipulation tasks can usually be decomposed into three types of phases, namely free motion, discontinuous contact, and continuous contact. For phase with free motion, the motion planning method is utilized for generating smooth trajectory. For phase with discontinuous contact in the axes of interest during the pre-manipulation process, the rule-based model-free method, namely the Policy Gradients with Human-Guided Parameter-based Exploration (PGHGPE) method is utilized. For the manipulation phase with continuous contacts, the model-based method is utilized because of its higher sample efficiency. Finally, the simulation and experimental studies verify the effectiveness of the presented algorithm.


WeAT7	307
Motion and Path Planning 5	Regular Session
Chair: Zhang, Dong	School of Information Science and Technology, Beijing University of Chemical Technology
Co-Chair: Calinon, Sylvain	Idiap Research Institute

10:30-10:35, Paper WeAT7.1
Swept Volume-Based Continuous Object Gathering Trajectory Generation for Tethered Robot Duo

Du, Yuanyuan	The Chinese University of Hong Kong, Shenzhen
Zhang, Jianan	Peking University
Cheng, Xiang	Pku
Cui, Shuguang	Cush, Sz
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots Abstract: We propose a continuous gathering scheme based on the swept volume to address the challenges involved in planning a tethered robot duo to efficiently collect marine debris. Specifically, we model the tethered robot duo by constructing a double-layer U-shape, and then apply an object-aware optimization approach that leverages the swept volume signed distance field (SVSDF) to guide trajectory optimization, promoting complete object collection while maintaining a continuous and collision-free gathering motion. Existing algorithms either fail to fully address key challenges, such as assuming an unrealistically infinite tether length or incurring high computational costs. In contrast, our proposed method, by adopting the double-layer U-shape technique, effectively manages tether length constraints and preserves the tether shape, ensuring feasible collection. By utilizing the SVSDF technique to guide the trajectory optimization process, we maximize the swept coverage of objects while minimizing that of obstacles. This enables complete object coverage, avoids collisions, and prevents the tether from becoming trapped by obstacles during the collection process. Moreover, we propose a set of metrics for this gathering planning problem and validate the generated trajectories in simulation, using a collision-free multi-UAV information-gathering approach to efficiently estimate the target area. Simulations demonstrate that our proposed method achieves superior, resolution-independent gathering performance compared to existing algorithms.

10:35-10:40, Paper WeAT7.2
Tactile Ergodic Coverage on Curved Surfaces

Bilaloglu, Cem	Idiap Research Institute, École Polytechnique Fédérale De Lausan
Löw, Tobias	Idiap Research Institute, EPFL
Calinon, Sylvain	Idiap Research Institute
Keywords: Motion and Path Planning, Reactive and Sensor-Based Planning, Force Control, Ergodic Control Abstract: In this article, we present a feedback control method for tactile coverage tasks, such as cleaning or surface inspection. These tasks are challenging to plan due to complex, continuous physical interactions. In these tasks, the coverage target and progress can be easily measured using a camera and encoded in a point cloud. We propose an ergodic coverage method that operates directly on point clouds, guiding the robot to spend more time on regions requiring more coverage. For robot control and contact behavior, we use geometric algebra to formulate a task-space impedance controller that tracks a line while simultaneously exerting a desired force along that line. We evaluate the performance of our method in kinematic simulations and demonstrate its applicability in real-world experiments on kitchenware. Our source codes, experimental data, and videos are available as open access at url{https://sites.google.com/view/tactile-ergodic-control/}.

10:40-10:45, Paper WeAT7.3
A Real-Time Collision-Avoidance Motion Planner for Robot Soccer

Deogan, Aneesh	Eindhoven University of Technology
Bruijnen, Dennis	Intralox LLC Europe
van den Brand, Mark	Eindhoven University of Technology
van de Molengraft, Marinus Jacobus Gerardus	University of Technology Eindhoven
Keywords: Motion and Path Planning, Reactive and Sensor-Based Planning, Collision Avoidance Abstract: This work presents the implementation and evaluation of a real-time collision-avoiding motion planning algorithm for highly dynamic environments. By combining short-horizon obstacle estimation with the robot constraints, our method implements collision-avoiding situational-aware motion planning by heuristically exploring multiple relevant paths. Directly feeding the current motion setpoint of the path into the low-level controller closes the loop between motion planning and low-level control, ensuring constraint-aware execution. Its practical implementation in physical robots in dynamic RoboCup-like scenarios validated its effectiveness, with low computational costs enabling fast adaptation to changing environments. Furthermore, the capabilities of our motion planner were demonstrated during RoboCup 2024 and practice matches.

10:45-10:50, Paper WeAT7.4
Safe Navigation in Uncertain Crowded Environments Using Risk Adaptive CVaR Barrier Functions

Wang, Xinyi	University of Michigan
Kim, Taekyung	University of Michigan
Hoxha, Bardh	Toyota Research Institute of North America
Fainekos, Georgios	Toyota NA-R&D
Panagou, Dimitra	University of Michigan, Ann Arbor
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Probability and Statistical Methods Abstract: Robot navigation in dynamic crowded environments poses a significant challenge due to the uncertainties inherent in the obstacles' model. In this work, we propose a risk-adaptive approach based on the CVaR Barrier Function (CVaR-BF): 1) We adjust the risk level to adopt the minimally-necessary risk when navigating through crowds, thereby achieving both safety and feasibility under uncertainties. 2) We characterize collision probability by evaluating the relative state between the robot and the obstacle and incorporate this risk assessment into a novel Control Barrier Function (CBF) design, termed the dynamic zone-based CBF. Paired with the risk-level adaptation, this dynamic zone-based CBF yields an enhanced responsiveness to obstacles in highly dynamic scenarios. Comparisons and ablation studies demonstrate that our method outperforms existing social navigation approaches and validate the effectiveness of the individual components of our adaptive strategy.

10:50-10:55, Paper WeAT7.5
Fast-Revisit Coverage Path Planning for Autonomous Mobile Patrol Robots Using Long-Range Sensor Information

Kachavarapu, Srinivas	Ostfalia University of Applied Sciences
Doernbach, Tobias	Ostfalia University of Applied Sciences
Gerndt, Reinhard	Ostfalia University of Applied Sciences
Keywords: Motion and Path Planning, Reactive and Sensor-Based Planning, Surveillance Robotic Systems Abstract: The utilization of Unmanned Ground Vehicles (UGVs) for patrolling industrial sites has expanded significantly. These UGVs typically are equipped with perception systems, e.g., computer vision, with limited range due to sensor limitations or site topology. High-level control of the UGVs requires Coverage Path Planning (CPP) algorithms that navigate all relevant waypoints and promptly start the next cycle. In this paper, we propose the novel Fast-Revisit Coverage Path Planning (FaRe-CPP) algorithm using a greedy heuristic approach to propose waypoints for maximum coverage area and a random search-based path optimization technique to obtain a path along the proposed waypoints with minimum revisit time. We evaluated the algorithm in a simulated environment using Gazebo and a camera-equipped TurtleBot3 against a number of existing algorithms. Compared to their average path lengths and revisit times, our FaRe-CPP algorithm showed a reduction of at least 21% and 33%, respectively, in these highly relevant performance indicators.

10:55-11:00, Paper WeAT7.6
PIPE Planner: Pathwise Information Gain with Map Predictions for Indoor Robot Exploration

Baek, Seungjae	Ulsan National Institute of Science and Technology
Moon, Brady	Brigham Young University
Kim, Seungchan	Carnegie Mellon University
Cao, Muqing	Carnegie Mellon University
Ho, Cherie	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Jeon, Jeong hwan	Ulsan National Institute of Science and Technology
Keywords: Motion and Path Planning, Reactive and Sensor-Based Planning Abstract: Autonomous exploration in unknown environments requires estimating the information gain of an action to guide planning decisions. While prior approaches often compute information gain at discrete waypoints, pathwise integration offers a more comprehensive estimation but is often computationally challenging or infeasible and prone to overestimation. In this work, we propose the Pathwise Information Gain with Map Prediction for Exploration (PIPE) planner, which integrates cumulative sensor coverage along planned trajectories while leveraging map prediction to mitigate overestimation. To enable efficient pathwise coverage computation, we introduce a method to efficiently calculate the expected observation mask along the planned path, significantly reducing computational overhead. We validate PIPE on real-world floorplan datasets, demonstrating its superior performance over state-of-the-art baselines. Our results highlight the benefits of integrating predictive mapping with pathwise information gain for efficient and informed exploration.

11:00-11:05, Paper WeAT7.7
ALVO: Adaptive Learning with Velocity Obstacles for UGV Navigation in Dynamic Scenes

Xie, Yinduo	Shandong University
Zhao, Yuenan	Shandong University
Song, Ran	Shandong University
Li, Zhiheng	Shandong University
Han, Lei	Tencent Robotics X
Zhang, Wei	Shandong University
Keywords: Motion and Path Planning, Reinforcement Learning Abstract: Autonomous navigation of unmanned ground vehicles (UGVs) in dynamic scenes is a challenging task that requires them to avoid obstacles and move toward the goal simultaneously. This paper proposes ALVO, an adaptive learning policy that leverages velocity obstacles for UGV navigation. ALVO employs an adaptive gating-based mechanism for reactive obstacle avoidance, which enables the UGV to either slow down or proactively navigate around obstacles based on the relative importance of the environmental state and the goal. A reward function based on velocity obstacles is also designed to guide the UGV to navigate toward the goal while avoiding obstacles. Extensive experiments demonstrate that ALVO outperforms the competing approaches in various dynamic environments. We also implemented our method on a real UGV and showed that it performed well in real-world scenarios.

11:05-11:10, Paper WeAT7.8
A Cosine Similarity Based Multitarget Path Planning Algorithm for Cable-Driven Manipulators (I)

Zhang, Dong	School of Information Science and Technology, Beijing University
Gai, Yan	Beijing University of Chemical Technology
Ju, Renjie	Beijing University of Chemistry Technology
Zhou, MengChu	New Jersey Institute of Technology
Cao, Zhengcai	Harbin Institute of Technology
Keywords: Motion and Path Planning, Soft Robot Applications, Biologically-Inspired Robots Abstract: Many path planning algorithms are proposed and employed for cable-driven manipulators (CDMs). However, most of them only consider single-target-point tasks. For multi-target-point tasks, CDMs need to repeat the planning and following of single point tasks. This is feasible but not optimal in terms of the distance and time needed by CDMs to complete such tasks. To solve this problem, this work designs a novel two-stage multi-target-point path planning (MPP) method. In the first stage, an improved rapidly exploring random tree (RRT)-A* algorithm that considers CDMs' features is used to preplan passable paths between each target and a start point. In the second one, in order to avoid CDM's repetitively moving along similar preplanned paths, a cosine similarity theory is used, for the first time, to integrate these paths. Furthermore, an indicator named path cost is defined to evaluate paths. This indicator takes into account CDMs' constraints, paths' lengths, and energy consumption. Simulations are conducted to compare MPP with some classical and recently developed algorithms. The results shows that it well outperform them in terms of path length and tracking time. Furthermore, the proposed method is verified by experiments in a 17 degree-of-freedom CDM prototype.


WeAT8	308
Medical Robots and Systems 5	Regular Session
Chair: Nguyen, Anh	University of Liverpool
Co-Chair: Guo, Yao	Shanghai Jiao Tong University

10:30-10:35, Paper WeAT8.1
Registration after Completion: Towards Sparse and Partial Point Set Registration for Computer-Assisted Orthopedic Surgery

Du, Xinzhe	Shandong University
Ma, Shixing	Shandong University
Zhang, Zhengyan	Harbin Institute of Technology, Shenzhen
Song, Rui	Shandong University
Li, Yibin	Shandong University
Meng, Max Q.-H.	The Chinese University of Hong Kong
Min, Zhe	University College London
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: In computer-assisted orthopedic surgery (CAOS), accurate point set registration is essential for enhancing surgical accuracy. However, the sparse and low-overlap nature of intraoperative point sets presents significant challenges for reliable registration. To deal with these challenges, we propose a novel registration-after-completion framework, where the intraoperative point set is first completed, after which the two full point sets are registered. Our main contributions include the follows. First, we propose a progressive two-stage strategy to progressively complete the sparse and partial intraoperative point set. Second, considering that 1) intra-operative point set contains noise 2) the point completion process is not perfect, and 3) the resolution of preoperative image is limited, we adopt the bidirectional hybrid mixture models (HMMs) to represent the point set pairs and formulate the probabilistic registration network. In the proposed novel correspondence network where a dual-path cross-attention mechanism is adopted for feature fusion and a clustering mechanism is leveraged for calculating point-to-mixture correspondences. Furthermore, the bidirectional registration mechanism is leveraged to compute the transformation based on estimated correspondences. Third, we have extensively validated the proposed approach on various datasets and bone phantoms. Our experiments on 1399 human femur and 1301 hip models demonstrate that our method achieves state-of-the-art performance across overlap rates from 15% to 35% and at various point counts (i.e., 25, 50, and 100 points) under conditions with less than 50% overlap. Additionally, real phantom experiments on femur and hip models validate the method’s performance in simulated surgical scenarios. Experiments on ModelNet40 further confirmed our method's effectiveness and generalizability.

10:35-10:40, Paper WeAT8.2
SplineFormer: An Explainable Transformer Network for Autonomous Endovascular Navigation

Jianu, Tudor	University of Liverpool
Doust, Shayan	University of Liverpool
Li, Mengyun	University of Liverpool
Huang, Baoru	Imperial College London
Do, Tuong	AIOZ
Nguyen, Hoan	University of Information Technology
Bates, Karl Thomas	University of Liverpool
Ta, Tung D.	The University of Tokyo
Fichera, Sebastiano	University of Liverpool
Berthet-Rayne, Pierre	King's College London
Nguyen, Anh	University of Liverpool
Keywords: Computer Vision for Medical Robotics, Computer Vision for Automation Abstract: Robot-assisted endovascular navigation provides significant advantages, including reduced radiation exposure for surgeons and improved patient safety. However, a major challenge is to control curvilinear instruments like guidewires precisely for smooth and accurate navigation while adapting to anatomical variations and external forces. Traditional segmentation-based approaches struggle with real-time prediction of the guidewire’s evolving shape, limiting their effectiveness in navigation tasks. In this paper, we propose SplineFormer, an explainable transformer network that predicts the continuous, structured representation of the guidewire as a B-spline. This formulation enables a compact, smooth, and explainable state representation that facilitates downstream navigation. By leveraging SplineFormer’s predictions within an imitation learning framework, our system successfully performs autonomous endovascular navigation. Experimental results show that SplineFormer achieves a 50% success rate when fully autonomously cannulating the Brachiocephalic Artery in a real robotic setup, demonstrating its potential for improved autonomous navigation in endovascular interventions.

10:40-10:45, Paper WeAT8.3
Towards Accurate Brain Electrode Implantation Via Cross-Modality Fusion of White-Light and Photoacoustic Microscopy

Liu, Yuxuan	Shanghai Jiao Tong University
Luo, Yating	Shanghai Jiao Tong University
Luan, Yunfei	Shanghai Jiao Tong University
Zhou, Xinyao	Shanghai Jiao Tong University
Yang, Jianxin	Shanghai Jiao Tong University
Guo, Yao	Shanghai Jiao Tong University
Yang, Guang-Zhong	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Planning, Medical Robots and Systems Abstract: Invasive flexible neural electrodes are becoming increasingly prevalent in monitoring and modulating brain neural activity, necessitating the precise and minimally invasive implantation of these electrodes to a depth of a few millimeters beneath the cerebral surface. Although Neuralink has pioneered robot-assisted neural electrode implantation guided by microscopy, it currently lacks the ability to detect non-cerebral surface microvessels that are invisible under the white-light microscope, leading to inaccurate implantation planning and a high risk of trauma. To address this limitation, we introduce a vascular-enhanced strategy that fuses intraoperative white-light microscopy and preoperative photoacoustic microscopy and applies the fusion results to our established microsurgical robotic system for brain electrode implantation. Specifically, a multi-modality data preprocessing pipeline is devised to extract representative features, and a 2.5D fusion network that incorporates a depth encoding mechanism is proposed to predict cross-modality correspondence. The enhanced fusion results are utilized for implantation planning and intraoperative guidance during in vivo surgical procedures. Both quantitative and qualitative results are presented to demonstrate the effectiveness of our proposed cross-modality fusion methods. Furthermore, in vivo surgical implementations on mice underscore the potential of the proposed approach for achieving more precise and minimally invasive brain electrode implantation.

10:45-10:50, Paper WeAT8.4
An Evidence-Based Tri-Branch Cross-Pseudo Supervision Method for Semi-Supervised Medical Image Segmentation

Li, Dongyue	Peking University
Luo, Aocheng	Peking University
Wang, Shaoan	Peking University
Hu, Yaoqing	Peking University
Pan, Jie	Nanyang Technological University
Wang, Yifei	Peking University
Yu, Junzhi	Chinese Academy of Sciences
Keywords: Computer Vision for Medical Robotics, Object Detection, Segmentation and Categorization, Deep Learning Methods Abstract: The semi-supervised medical image segmentation with a few annotated data can provide significant help in robot-assisted surgery. This step plays a pivotal role in identification of pathological regions, more appropriate planning of surgical procedures, and so on. In this work, we develop an evidence-based tri-branch cross-pseudo supervision model, which integrates evidence-based uncertainty estimation and multi-branch cross supervision to bolster the effectiveness of semi-supervised learning. The overall framework consists of a vanilla network and an evidential dual-branch network. Two evidential branches EPB and ERB are proposed to complement each other and improve the quality of pseudo-labels. The EPB places more focus on classification accuracy at the pixel level and the ERB emphasizes the similarity and overall integrity of the segmented regions. Then, a novel cross-pseudo supervision strategy among the three branches is designed, to guarantee that valuable and diverse unlabeled knowledge is explored and transferred for segmentation improvement. The effectiveness of the proposed method was verified on the ACDC dataset, achieving outstanding performance compared with other state-of-the-art methods. In addition, we conducted ablation study to validate the effectiveness of the evidential branches (EPB and ERB) and tri-branch cross-supervision strategy, respectively.

10:50-10:55, Paper WeAT8.5
EndoMUST: Monocular Depth Estimation for Robotic Endoscopy Via End-To-End Multi-Step Self-Supervised Training

Shao, Liangjing	Fudan University
Bai, Linxin	Fudan University
Du, Chenkang	Fudan University
Chen, Xinrong	Fudan University
Keywords: Computer Vision for Medical Robotics, Deep Learning for Visual Perception, Vision-Based Navigation Abstract: Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both of illumination reflectance and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4%-10% lower error. The code of this work has been pre-released on https://github.com/BaymaxShao/EndoMUST.

10:55-11:00, Paper WeAT8.6
TTTFusion: A Test-Time Training-Based Strategy for Multimodal Medical Image Fusion in Surgical Robots

Xie, Qinhua	East China Normal University
Tang, Hao	Peking University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: With the increasing use of surgical robots in clinical practice, enhancing their ability to process multimodal medical images has become a key research challenge. Although traditional medical image fusion methods have made progress in improving fusion accuracy, they still face significant challenges in real-time performance, fine-grained feature extraction, and edge preservation.In this paper, we introduce TTTFusion, a Test-Time Training (TTT)-based image fusion strategy that dynamically adjusts model parameters during inference to efficiently fuse multimodal medical images. By adapting the model during the test phase, our method optimizes the parameters based on the input image data, leading to improved accuracy and better detail preservation in the fusion results.Experimental results demonstrate that TTTFusion significantly enhances the fusion quality of multimodal images compared to traditional fusion methods, particularly in fine-grained feature extraction and edge preservation. This approach not only improves image fusion accuracy but also offers a novel technical solution for real-time image processing in surgical robots.

11:00-11:05, Paper WeAT8.7
High Confidence Surgical Instrument Transparency Adopting Miniature Multi-View Endoscope System

Xu, Handing	Tsinghua University
Nie, Zhenguo	Tsinghua University
Pan, Huimin	Tsinghua University
Xu, Yanjie	Tsinghua University
Liu, Xin-Jun	Tsinghua University
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, AI-Based Methods Abstract: With the continuous advancement of minimally invasive surgery, the incisions have become increasingly smaller, leading to a proportional convergence between the physical dimensions of surgical instruments and the operational space. This phenomenon has exacerbated the issue of visual obstruction caused by the overlapping of the instruments, which now stands as a significant technical impediment to the enhancement of precision in minimally invasive procedures. Vanilla approaches rely on deep learning-based image inpainting techniques to address this issue, but their results are unreliable for surgeons making decisions. This work rethinks hardware design, sacrificing resolution to increase the view of cameras, effectively filling the instrument occluded areas through multi-view technology. Subsequently, a super-resolution method is employed to restore the details of the inpainted images. This innovative approach transforms the uncertainty of deep learning from a large-range pixel inference problem into a more controllable pixel interpolation task, thereby significantly enhancing the reliability and accuracy of the repaired results. Taking spinal endoscopic surgery as an application scenario, we designed a miniature multi-view endoscope system tailored to the specific needs of this surgery. Phantom experiments are conducted to verify the feasibility of the proposed instrument transparency method. The results demonstrate the potential of the proposed method for improving minimally invasive surgery.

11:05-11:10, Paper WeAT8.8
Compact and Foldable Hip Exoskeleton with High Torque Density Actuator for Walking and Stair-Climbing Assistance in Young and Older Adults (I)

Yan, Yuming	North Carolina State University
Huang, Jin Sen	North Carolina State University
Zhu, Junxi	North Carolina State University
Hou, Zhimin	National University of Singapore
Gao, Weibo	North Carolina State University
Lopez-Sanchez, Ivan Alonso	North Carolina State University
Srinivasan, Nitin	North Carolina State University
Srihari, Advaith	North Carolina State University
Su, Hao	New York University
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Medical Robots and Systems Abstract: Exoskeletons hold great potential to enhance human locomotion performance, but their development is often hindered by bulky, heavy, and obtrusive actuator and mechanism designs. Here, we present a compact and lightweight hip exoskeleton en- dowed with a custom high torque density actuator and two fold- able mechanisms, namely foldable waist belt with self-alignment mechanism and foldable thigh brace with self-adjusting linear slider mechanism. Our model of actuator electromagnetic design considered four design parameters, including end winding length, stator teeth number, rotor pole pair number and gear ratio that are tailored for portable exoskeletons. Two foldable mechanism enhanced exoskeleton adaptability and user comfort. Benchtop experimental results demonstrated that our actuator can provide an 18 Nm peak torque with a packaging factor improvement of 27% in contrast with state-of-the-art actuators used in exoskeletons. The volume of our exoskeleton was reduced by 55% and the weight was reduced from 3.7kg to 2.7kg compared to our prior design. Preliminary human experiments demonstrated the feasibility of our exoskeleton to reduce metabolic rate during walking and stair climbing for young and older adults.


WeAT9	309
Object Detection, Segmentation and Categorization 1	Regular Session
Chair: Lu, Zhenyu	South China University of Technology
Co-Chair: Cao, Hu	Technical University of Munich

10:30-10:35, Paper WeAT9.1
CMGFA: A BEV Segmentation Model Based on Cross-Modal Group-Mix Attention Feature Aggregator

Kuang, Xinkai	University of Science and Technology of China
Niu, Runxin	Hefei Institutes of Physical Science, Chinese Academy of Science
Hua, Chen	University of Science and Technology of China
Jiang, Chunmao	University of Science and Technology of China
Zhu, Hui	Hefei Institutes of Physical Science, Chinese Academy of Science
Chen, Ziyu	University of Science and Technology of China
Yu, Biao	Hefei Institutes of Physical Science, Chinese Academy of Science
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Sensor Fusion Abstract: Bird's eye view (BEV) segmentation map is a recent development in autonomous driving that provide effective environmental information, such as drivable areas and lane dividers. Most of the existing methods use cameras and LiDAR as inputs for segmentation and the fusion of different modalities is accomplished through either concatenation or addition operations, which fails to exploit fully the correlation and complementarity between modalities. This paper presents the CMGFA (Cross-Modal Group-mix attention Feature Aggregator), an end-to-end learning framework that can adapt to multiple modal feature combinations for BEV segmentation. The CMGFA comprises the following components: (i) The camera has a dual-branch structure that strengthens the linkage between local and global features. (ii) Multi-head deformable cross-attention is applied as cross-modal feature aggregators to aggregate camera, LiDAR, and Radar feature maps in BEV for implicitly fusion. (iii) The Group-Mix attention is used to enrich the attention map feature space and enhance the ability to segment between different categories. We evaluate our proposed method on the nuScenes and Argoverse2 datasets, where the CMGFA significantly outperforms the baseline. The code will be updated at https://github.com/kuangxk2016/CMGFA/tree/master

10:35-10:40, Paper WeAT9.2
Feature-Aligned Fisheye Object Detection Network for Autonomous Driving

Cao, Hu	Technical University of Munich
Sun, Dongyi	Technical University of Munich
Song, Rui	Fraunhofer IVI
Xia, Yan	Technical University of Munich
Li, Xinyi	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: Fisheye cameras, renowned for their panoramic field of view (FOV) of 360 are crucial for surround-view perception in autonomous driving. However, research on object perception in fisheye images lags behind that of standard images. To address this gap, we propose a feature-aligned fisheye object detection network specifically tailored for autonomous driving. Current fisheye perception algorithms often overlook the misalignment issues that typically arise in object detectors. To tackle these challenges in the feature pyramid network (FPN), we introduce a feature-aligned pyramid module (FaPM), which learns pixel transformation offsets to contextually align feature maps. Additionally, we present a location-aligned detection head (LaDH) to align the spatial distribution of classification and regression localization. Integrating these modules into a detection framework results in a novel feature-aligned fisheye object detector. Our method undergoes extensive evaluation on the WoodScape dataset, achieving a mean average precision (mAP) of 32.2%, surpassing the performance of existing methods.

10:40-10:45, Paper WeAT9.3
Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

Huang, Yizhou	Hunan University
Yang, Fan	Hunan University
Zhu, Guoliang	Hunan University
Li, Gen	The University of Edinburgh
Shi, Hao	Zhejiang University
Zuo, Yukun	Hunan University
Chen, Wenrui	Hunan University
Li, Zhiyong	HUNAN UNIVERSITY
Yang, Kailun	Hunan University
Keywords: Object Detection, Segmentation and Categorization, RGB-D Perception, Deep Learning for Visual Perception Abstract: Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model's parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.

10:45-10:50, Paper WeAT9.4
Multistream Network for LiDAR and Camera-Based 3D Object Detection in Outdoor Scenes

Ibrahim, Muhammad	University of Western Australia
Akhtar, Naveed	University of Melbourne
Wang, Haitian	University of Western Australia
Anwar, Saeed	University of Western Australia
Mian, Ajmal	University of Western Australia
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Sensor Fusion Abstract: Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection tasks still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, which meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird's-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural, and geometric information are carefully fused and fed to a detection head for 3D object detection. We evaluate our method on the challenging KITTI Object Detection Benchmark using the official testing server. Our approach achieves strong performance, with an average precision (AP) of 85.39% in 3D detection,91.34% in Bird's Eye View (BEV) detection, and 96.39% in 2D detection. These results match or surpass existing state-of-the-art methods. In the difficult ``Hard'' category, our method attains 80.78% AP in 3D detection and 94.04% AP in 2D detection, highlighting its robustness in challenging scenarios. Furthermore, our method runs at 67 ms, demonstrating efficiency and real-time capability. Our code will be released through the MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.

10:50-10:55, Paper WeAT9.5
SRCNet: Super-Resolution Networks for Capsule Endoscope Robots

Tan, Menglu	Beihang University
Zhan, Guangdong	Beihang University
Zeng, Zijin	Beihang University
Wang, Ao	Beihang University
Feng, Lin	Beihang University
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, AI-Based Methods Abstract: In recent years, capsule robots have gained wide acceptance among doctors and patients for the examination of gastrointestinal diseases due to their non-invasive, safe, and painless advantages. However, the image resolution captured by capsule robots is limited by space size and power, which hinders doctors' ability to accurately assess patients' stomach conditions and real-time control of the capsule robot. This paper proposes the design of two super-resolution networks for capsule robot videos. The first network, EndoVSR, is a high-performance offline video super-resolution network based on a generative adversarial network. It is designed to enhance the resolution of captured videos during offline processing. The second network, Bi-RUN, is a real-time video super-resolution network based on recurrent neural networks. It is designed to enhance the resolution of videos in real-time, enabling doctors to have a clearer view of the stomach condition during the examination. Extensive training and verification of these networks have been conducted using different datasets. All the performance indicators achieved leading positions. Furthermore, simulation experiments were carried out on pig stomachs in vitro to further validate the performance of the proposed networks in practical applications.

10:55-11:00, Paper WeAT9.6
Synthetica: Large Scale Synthetic Data Generation for Robot Perception

Singh, Ritvik	NVIDIA
Liu, Jason Jingzhou	Carnegie Mellon Unieversity
Van Wyk, Karl	NVIDIA
Chao, Yu-Wei	NVIDIA
Lafleche, Jean-Francois	NVIDIA
Shkurti, Florian	University of Toronto
Ratliff, Nathan	NVIDIA
Handa, Ankur	NVidia
Keywords: Object Detection, Segmentation and Categorization, Data Sets for Robotic Vision, Simulation and Animation Abstract: Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localization in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly, especially for custom assets, such as industrial objects, making it untenable for generalization to in-the-wild scenarios. To this end, we present Synthetica, a method for large-scale synthetic data generation for training robust state estimators. This paper focuses on the task of object detection, an important problem which can serve as the front-end for most state estimation problems, such as pose estimation. Leveraging data from a photorealistic ray-tracing renderer, we scale up data generation, generating 2.7 million images, to train highly accurate real-time detection transformers. We present a collection of rendering randomization and training-time data augmentation techniques conducive to robust sim-to-real performance for vision tasks. We demonstrate state-of-the-art performance on the task of object detection while having detectors that run at 50--100Hz which is 9 times faster than the prior SOTA. We further demonstrate the usefulness of our training methodology for robotics applications by showcasing a pipeline for use in the real world with custom objects for which there do not exist prior datasets. Our work highlights the importance of scaling synthetic data generation for robust sim-to-real transfer while achieving the fastest real-time inference speeds. Videos and supplementary information can be found at https://sites.google.com/view/synthetica-vision

11:00-11:05, Paper WeAT9.7
A Multi-Task Learning System for Composites Defect Segmentation and Classification with TacRoller

Li, Xiaolong	University of Liverpool
Li, Tunwu	University of Bristol
Lu, Zhenyu	South China University of Technology
Zeng, Chao	University of Liverpool
Cheng, Guangliang	University of Liverpool
Yang, Chenguang	University of Liverpool
Keywords: Object Detection, Segmentation and Categorization, Force and Tactile Sensing, Deep Learning Methods Abstract: Due to non-destructive testing (NDT) techniques being both expensive and inconvenient in dynamic detection scenarios, innovative alternatives are urgently needed to address cost-efficiency and deployment challenges. We first design TacRoller, a tactile sensor roller for automated characterization of surface defects in composite materials, to address the dilemma. It collects tactile images of defects on the composite's plies by capturing changes caused by deformation of the outer elastomer through the internal camera. It reduces the cost of inspection by 80% to 90% compared to NDT equipment like radiographic testing while ensuring detection efficiency. It takes 58.86 seconds to complete a 35cm x 18cm x 0.5mm dry-woven fabric. Moreover, we collect a total of 2,744 images of samples of dry-woven fabric unidirectional prepreg through TacRoller to form a dataset, including wrinkles, foreign objects and debris (FODs), broken fibre, voids and healthy textures. Subsequently, we propose a multi-order gated aggregation (MOGA)-U-Net to tackle critical challenges of noise sensitivity and multi-scale defect recognition in tactile images, enabling robust segmentation and multi-category classification tasks. The results show that the MOGA-U-Net achieves a test dice coefficient of 76.0% and classification accuracy of 98.9%, outperforming DeepLabV3 and other benchmarks. By providing a scalable and effective NDT substitute, our system realises autonomous defect identification and classification on composite surfaces, thus improving quality control in the production of composites.

11:05-11:10, Paper WeAT9.8
Task-Oriented Token Pruning for Efficient Object Detection and Segmentation

Liang, Hao	Institute of Computing Technology, Chinese Academy of Sciences
Kan, Meina	Institute of Computing Technology, Chinese Academy of Sciences
Shan, Shiguang	Chinese Academy of Sciences, Institute of Computing Technology
Chen, Xilin	Institute of Computing Technology, Chinese Academy
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: Robots rely heavily on visual perception to understand and interact with complex environments. To support this capability, modern perception models have become increasingly large and powerful, resulting in high computational costs that hinder their real-time performance in robotic applications. Existing acceleration techniques, such as model pruning and token pruning, focus on reducing architectural or parameter redundancy but still process all object categories, regardless of task requirements. However, in real-world robotic scenarios, different tasks typically require only a subset of object categories. For instance, a service robot may focus on kitchenware while cooking, but shift to furniture and obstacles while cleaning. This task-dependent variation creates opportunities to reduce computational cost by selectively processing relevant information. Existing methods are not designed to exploit this potential for task-specific efficiency. To address this limitation, we propose TaskTP, a task-oriented token pruning method that dynamically adjusts token pruning based on the target category set. A dynamic gating network is introduced between successive Transformer blocks, which evaluates the relevance of each token to the given task. TaskTP allows for more aggressive pruning when fewer categories are required, optimizing computation without sacrificing performance. After a task-agnostic training phase, it can be flexibly configured at deployment time to support any category subset without retraining, making it both efficient and versatile. TaskTP improves the performance of Mask R-CNN from 31.4 fps to 38.5 fps on the COCO dataset. Furthermore, on the ScanNet dataset, where an object search task was defined to simulate real-world robotic applications, processing time was reduced from 3197 ms to 2437 ms, demonstrating significant efficiency gains.


WeAT10	310
Range Sensing 1	Regular Session
Chair: Xu, Philippe	ENSTA Paris, Institut Polytechnique De Paris
Co-Chair: Xiao, Jiangjian	Ningbo Institute of Industrial Technology, Chinese Academy of Science

10:30-10:35, Paper WeAT10.1
Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

Chen, Siyu	Jimei University
Han, Ting	Sun Yat-Sen University
Zhang, Changshe	Xidian University
Liu, Weiquan	Jimei University
Su, Jinhe	Jimei University
Wang, Zongyue	Jimei University
Cai, Guorong	Jimei University
Keywords: RGB-D Perception, Computer Vision for Transportation, Object Detection, Segmentation and Categorization Abstract: RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

10:35-10:40, Paper WeAT10.2
Simpler Is Better: Revisiting Doppler Velocity for Enhanced Moving Object Tracking with FMCW LiDAR

Zeng, Yubin	National University of Defense Technology
Wu, Tao	National University of Defense Technology
Qi, Shouzheng	National University of Defense Technology
Li, Junxiang	National University of Defense Technology
Duan, Xingyu	National University of Defense Technology
Yu, Youjin	National University of Defense Technology
Keywords: Range Sensing, Autonomous Vehicle Navigation Abstract: Real-time and accurate perception of dynamic objects is crucial for autonomous driving. To better capture the motion information of objects, some methods now employ 4D Doppler point clouds collected by frequency-modulated continuous-wave (FMCW) LiDAR to enhance the detection and tracking of moving objects. Compared to standard time-of-flight (ToF) LiDAR, FMCW LiDAR can provide the relative radial velocity of each point through the Doppler effect, offering a more detailed understanding of an object's motion state. However, despite the proven efficacy of these methods, ablation studies reveal that the direct contribution of Doppler velocity to tracking is limited, with performance gains often resulting from improved object recognition and labeling accuracy. Revisiting the role of Doppler velocity, this study proposes DopplerTrack, a simple yet effective learning-free tracking method tailored for FMCW LiDAR. DopplerTrack harnesses Doppler velocity for efficient point cloud preprocessing and object detection with O(n) complexity. Furthermore, by exploring the potential motion directions of objects, it reconstructs the full velocity vector, enabling more direct and precise motion prediction. Extensive experiments on four datasets demonstrate that DopplerTrack outperforms existing learning-free and learning-based methods, achieving state-of-the-art tracking performance with strong generalization across diverse scenarios. Moreover, DopplerTrack runs efficiently at 120 Hz on a mobile CPU, making it highly practical for real-world deployment. The code and datasets have been released at https://github. com/12w2/DopplerTrack.

10:40-10:45, Paper WeAT10.3
Rendering Anywhere You See: Renderability Field-Guided Gaussian Splatting

Xiaofeng, Jin	Politecnico Di Milano
Fang, Yan	University of Chinese Academy of Sciences
Frosi, Matteo	Politecnico Di Milano
Jianfei, Ge	Ningbo Institute of Materials, Chinese Academy of Sciences
Xiao, Jiangjian	Ningbo Institute of Industrial Technology, Chinese Academy of Sc
Matteucci, Matteo	Politecnico Di Milano
Keywords: Range Sensing, Sensor Fusion, Computer Vision for Automation Abstract: Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360° views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visible-light styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.

10:45-10:50, Paper WeAT10.4
A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation

Marsal, Remi	ENSTA Paris
Chapoutot, Alexandre	ENSTA Paris
Xu, Philippe	ENSTA Paris, Institut Polytechnique De Paris
Filliat, David	ENSTA ParisTech
Keywords: Range Sensing, RGB-D Perception, Deep Learning for Visual Perception Abstract: The recent development of foundation models for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth, of the camera-LiDAR calibration or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at github.com/ENSTA-U2IS-AI/depth-rescaling.

10:50-10:55, Paper WeAT10.5
A Patch-Based Transformer Method for Electrical Capacitance Tomography Image Reconstruction

Wang, Yuliang	Beijing University of Posts and Telecommunications
Shi, Duanpeng	Beijing University of Post and Communication
Liu, Huaping	Tsinghua University
Guo, Di	Beijing University of Posts and Telecommunications
Keywords: AI-Based Methods, Reactive and Sensor-Based Planning, Force and Tactile Sensing Abstract: Electrical capacitance tomography (ECT) is a contactless and non-invasive imaging technique, which visualizes the internal permittivity distribution around a region utilizing boundary capacitance measurements. It has been widely used in the fields of object classification, tactile sensing and multiphase flows monitoring. However, due to the inherent non-linearity and ill-conditioned nature of the ECT inverse problem, its practical implementation remains limited by challenges in the image reconstruction accuracy. To tackle the above problems, we propose a patch-based transformer method (PT) for an accurate reconstruction of ECT images. Specifically, the complex capacitance-to-image mapping is systematically decoupled into the capacitance-to-patch feature extraction and patch-to-image reconstruction, enabling more efficient and accurate permittivity distribution recovery through localized feature learning and global context integration. Additionally, a simulation ECT dataset for objects with varying sizes and positions is established.

10:55-11:00, Paper WeAT10.6
Event-Based Depth from Focus

Xue, Wenjie	Epson Canada
Shang, Limin	Epson
Keywords: Range Sensing, Data Sets for Robotic Vision Abstract: Depth from focus (DFF) is a well-established method for measuring depth in vision systems. However, its efficacy and accuracy are limited by the slow speed required to capture high-quality focal stacks. We address this limitation by leveraging emerging hardware technologies: the event camera and liquid lens. In this paper, we introduce an innovative approach called Event-Based Depth from Focus (EDFF). We present a prototype system and propose Event Cancellation Score (ECS) as a novel metric to efficiently detect event data focus. To validate the effectiveness of our system, we have curated the first EDFF dataset, which comprises event recordings of focal sweeps performed on 3D-printed test targets. Comparative analysis against existing event focus detection algorithms demonstrates the superior performance of our algorithm in the EDFF task.

11:00-11:05, Paper WeAT10.7
LSW-Net: A Spatio-Temporal Self-Supervised Framework for 2D LiDAR-Based Environment Perception

Dai, Haojie	Tongji University
Cui, Yujie	Tongji University
Shi, Wenbo	Tongji University
Ji, Mazeyu	UCSD
Liu, Chengju	Tongji University
Chen, Qijun	Tongji University
Keywords: Range Sensing, SLAM, Mapping Abstract: In the deep learning era, 2D LiDAR perception is often overlooked as research prioritizes 3D point clouds. Yet, 2D LiDAR remains essential for low-cost robotic systems due to its affordability. Despite its simplicity, it faces major challenges-not only in perception but also in learning itself, as data sparsity and limited features hinder effective framework development. Additionally, fundamental structural differences prevent direct adaptation of 3D perception networks. To address these aforementioned concerns, we proposes LSW-Net(Laser Scan Weight-Net), a self-supervised framework for 2D LiDAR perception, enabling end-to-end learning from raw point clouds to adaptive weight extraction. This framework provides a scalable and lightweight solution for 2d LiDAR environmental perception, and its self-supervised nature reduces annotation costs.It includes: (i) a general 2D Laser Encoder (LS-Encoder) that integrates local convolutional perception with global attention perception to extract multi-scale spatiotemporal features; and (ii) an interpretable weight extraction module (Weight Extractor) that dynamically quantifies the contribution of each point in environmental perception tasks through contrastive learning and physical consistency constraints. Evaluated on diverse scenes for point cloud registration and SLAM tasks, LSW-Net outperforms traditional methods in feature discriminability and adaptability.Additionally, we performed solid ablation experiments to substantiate the rationality of our approach. Our code is available at https://github.com/YukiaCUI/LSW-Net.

11:05-11:10, Paper WeAT10.8
UE-Extractor : A Grid-To-Point Ground Extraction Framework for Unstructured Environments Using Adaptive Grid Projection

Li, Ruoyao	Shanghai Jiao Tong University
Wang, Yafei	Shanghai Jiao Tong University
Sun, Shi	Shanghai Jiao Tong University
Zhang, Yichen	Shanghai Jiao Tong University
Fei, Ding	Hunan University
Gao, Hongbo	University of Science and Technology of China
Keywords: Range Sensing, Mapping, Field Robots Abstract: Ground point cloud extraction is crucial for route planning of autonomous vehicles in unstructured environments. However, mainstream point cloud extraction methods are susceptible to inaccuracies due to the indistinct obstacle-ground boundary. Furthermore, addressing uneven feature distribution usually necessitates region segmentation, which increases computational demands. To achieve a balance between efficiency and accuracy in ground extraction, we propose a two-stage framework based on adaptive bin partition and grid projection. Firstly, the point cloud is divided into bins based on point cloud distribution and then projected onto the uniform grids, ensuring robust consideration of uneven features. Subsequently, the grid-based coarse extraction is performed by analyzing the grid characteristics to enable a rapid preliminary extraction. Furthermore, the coarse results are reprojected into point cloud form, and the ground-obstacle boundary regions are finally refined by incorporating elevation and curvature probabilities. Evaluations on RELLIS-3D dataset and field tests conducted across typical unstructured scenarios demonstrate that the proposed method achieves promising extraction performance compared to state-of-the-art methods.


WeAT11	311A
Reinforcement Learning 5	Regular Session
Chair: Basiri, Meysam	Instituto Superior Técnico
Co-Chair: Zuo, Zhiqiang	Tianjin University

10:30-10:35, Paper WeAT11.1
Sample-Efficient Deep Reinforcement Learning of Mobile Manipulation for 6-DOF Trajectory Following (I)

Zhou, Yifan	Shanghai Jiao Tong University
Feng, Qiyu	Shanghai Jiao Tong University
Zhou, Yixuan	Shanghai Jiao Tong University
Lin, Jianghao	Shanghai Jiao Tong University
Liu, Zhe	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Reinforcement Learning, Whole-Body Motion Planning and Control, Deep Learning Methods Abstract: The whole-body control of mobile manipulators for the 6-DOF trajectory following task is the basis of many continuous tasks. However, traditional control strategies rely on accurate models and expert knowledge for solving the trajectory following task. Deep reinforcement learning (DRL) provides a promising model-free solution, but it is sample-inefficient. To this end, we propose Trajectory Following Hindsight Experience Replay (TF-HER), a sample-efficient DRL algorithm for the whole-body coupled trajectory following task with dense rewards. TF-HER builds a multi-trajectory state space, and relabels the low-reward data to generate informative high-reward experiences. Also, the distributional shift caused by the relabeling is corrected by estimating the density ratio of relabeled experiences. Extensive demonstrations on both nonholonomic and holonomic bases in simulation validate that our algorithm can accelerate the model convergence and significantly improve the sample efficiency. Furthermore, we present real-world experiments to demonstrate the effectiveness of our approach. The code is available: https://github.com/IRMV-Manipulation-Group/TF-HER

10:35-10:40, Paper WeAT11.2
A Robust MLTD3 Path Planning Algorithm in Unknown Environments

Wen, Tianqing	Shan Dong University of Science and Technology
Wang, Xiaomin	Shan Dong University of Science and Technology
Yang, Rui	Ocean Univerisity of China
Sun, Zhendong	Shandong University of Science and Technology
Keywords: Reinforcement Learning, Collision Avoidance, Motion and Path Planning Abstract: We explore the problem of path planning for mobile robots in unknown environments using deep reinforcement learning (DRL). This paper proposes a multi-layer long short-term memory twin delayed deep deterministic policy gradient (MLTD3) algorithm for unknown environments. First, we introduce multi-layer long short-term memory (LSTM) networks to the actor network of the twin delayed deep deterministic policy gradient (TD3), to capture rich historical information to alleviate local optimal solutions in local path planning. Secondly, Poisson coding is employed in the state space to deal with the uncertainty of environments, so that the algorithm can discern the relationship within long sequence information. Then novel extrinsic and intrinsic reward functions are designed to avoid the dynamic obstacles in environments. Finally, the performance of the proposed algorithm is verified through simulations in ROS Gazebo and physical experiments in an unknown environment.

10:40-10:45, Paper WeAT11.3
CueLearner: Bootstrapping and Local Policy Adaptation from Relative Feedback

Schiavi, Giulio	ETH Zürich
Cramariuc, Andrei	ETH Zurich
Ott, Lionel	ETH Zurich
Siegwart, Roland	ETH Zurich
Keywords: Reinforcement Learning, Imitation Learning, Deep Learning Methods Abstract: Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as “more to the left”) offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user's preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.

10:45-10:50, Paper WeAT11.4
From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning

Li, Zeqiao	School of Electrical and Information Engineering, Tianjin Univer
Wang, Yijing	Tianjin University
Wang, Haoyu	Tianjin University
Li, Zheng	Tianjin University
Li, Peng	Tianjin University
Liu, Wenfei	Tianjin University
Zuo, Zhiqiang	Tianjin University
Keywords: Reinforcement Learning, Learning from Demonstration, Intelligent Transportation Systems Abstract: Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like behavior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving.

10:50-10:55, Paper WeAT11.5
SimLauncher: Launching Sample-Efficient Real-World Robotic Reinforcement Learning Via Simulation Pre-Training

Mingdong Wu, Aaron	Peking University
Wu, Lehong	Peking University
Wu, Yizhuo	Peking University
Huang, Weiyao	Peking University
Fan, Hongwei	Peking University
Hu, Zheyuan	University of California, Berkeley
Geng, Haoran	University of California, Berkeley
Li, Jinzhou	Cornell University
Ying, Jiahe	Peking University
Yang, Long	Peking University
Chen, Yuanpei	South China University of Technology
Dong, Hao	Peking University
Keywords: Reinforcement Learning, Dexterous Manipulation, Deep Learning in Grasping and Manipulation Abstract: Autonomous learning of dexterous, long-horizon robotic skills has been a longstanding pursuit of embodied AI. Recent advances in robotic reinforcement learning (RL) have demonstrated remarkable performance and robustness in real-world visuomotor control tasks. However, applying RL in the real world faces challenges such as low sample efficiency, slow exploration, and significant reliance on human intervention. In contrast, simulators offer a safe and efficient environment for extensive exploration and data collection, while the visual sim-to-real gap, often a limiting factor, can be mitigated using real-to-sim techniques. Building on these, we propose SimLauncher, a novel framework that combines the strengths of real-world RL and real-to-sim-to-real approaches to overcome these challenges. Specifically, we first pre-train policies in digital twin simulation environments, which then benefit real-world RL in two ways: (1) bootstrapping target values using real-world demonstrations derived from simulation policy rollouts and extensive simulation demonstrations, and (2) accelerating exploration using the pre-trained policy. We conduct comprehensive experiments across long-horizon, contact-rich, and dexterous hand manipulation tasks. Compared to prior real-world RL approaches, SimLauncher significantly improves training efficiency and achieves near-perfect success rates. We hope this work serves as a proof-of-concept and inspires further research on leveraging large-scale simulation pre-training to benefit real-world robotic learning.

10:55-11:00, Paper WeAT11.6
GRaD-Nav: Efficiently Learning Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

Chen, Qianzhong	Stanford University
Sun, Jiankai	Stanford University
Gao, Naixiang	Stanford University
Low, JunEn	Stanford University
Chen, Timothy	Stanford University
Schwager, Mac	Stanford University
Keywords: Reinforcement Learning, Vision-Based Navigation, Aerial Systems: Perception and Autonomy Abstract: Autonomous visual navigation is an essential element in robot autonomy. Reinforcement learning (RL) offers a promising policy training paradigm. However, existing RL methods suffer from high sample complexity, poor sim-to-real transfer, and limited runtime adaptability. These problems are particularly challenging for drones, with complex nonlinear and unstable dynamics, and strong dynamic coupling between control and perception. In this paper, we propose a novel framework that integrates 3D Gaussian Splatting (3DGS) with differentiable deep reinforcement learning (DDRL) to train vision-based drone navigation policies. By leveraging high-fidelity 3D scene representations and differentiable simulation, our method improves sample efficiency and sim-to-real transfer. Additionally, we incorporate a Context-aided Estimator Network (CENet) to adapt to environmental variations at runtime. Moreover, by curriculum training in a mixture of different surrounding environments, we achieve in-task generalization, the ability to solve new instances of a task not seen during training. Drone hardware experiments demonstrate our method's high training efficiency compared to state-of-the-art RL methods, zero shot sim-to-real transfer for real robot deployment without fine tuning, and ability to adapt to new instances within the same task class (e.g. to fly through a gate at different locations with different distractors in the environment). Our simulator and training framework are open-sourced at: https://github.com/Qianzhong-Chen/grad_nav.

11:00-11:05, Paper WeAT11.7
Multitask Reinforcement Learning for Quadcopter Attitude Stabilization and Tracking Using Graph Policy

Liu, Yu Tang	Max Planck Institute Intelligent System
Vale, Afonso	Instituto Superior Técnico
Ahmad, Aamir	University of Stuttgart
Ventura, Rodrigo	Instituto Superior Técnico
Basiri, Meysam	Instituto Superior Técnico
Keywords: Reinforcement Learning, Aerial Systems: Mechanics and Control Abstract: Quadcopter attitude control involves two tasks: smooth attitude tracking and aggressive stabilization from arbitrary states. Although both can be formulated as tracking problems, their distinct state spaces and control strategies complicate a unified reward function. We propose a multitask deep reinforcement learning framework that leverages parallel simulation with IsaacGym and a Graph Convolutional Network (GCN) policy to address both tasks effectively. Our multitask Soft Actor-Critic (SAC) approach achieves faster, more reliable learning and higher sample efficiency than single-task methods. We validate its real-world applicability by deploying the learned policy—a compact two-layer network with 24 neurons per layer—on a Pixhawk flight controller, achieving 400 Hz control without extra computational resources. We provide our code at url{https://github.com/robot-perception-group/GraphMTSAC_U AV/}.

11:05-11:10, Paper WeAT11.8
Reinforcement Learning of Flexible Policies for Symbolic Instructions with Adjustable Mapping Specifications

Hatanaka, Wataru	Ricoh Company, Ltd
Yamashina, Ryota	Ricoh Company, Ltd
Matsubara, Takamitsu	Nara Institute of Science and Technology
Keywords: Reinforcement Learning, Integrated Planning and Learning Abstract: Symbolic task representation is a powerful tool for encoding human instructions and domain knowledge. Such instructions guide robots to accomplish diverse objectives and meet constraints through reinforcement learning (RL). Most existing methods are based on fixed mappings from environmental states to symbols. However, in inspection tasks, where equipment conditions must be evaluated from multiple perspectives to avoid errors of oversight, robots must fulfill the same symbol from different states. To help robots respond to flexible symbol mapping, we propose representing symbols and their mapping specifications separately within an RL policy. This approach imposes on RL policy to learn combinations of symbolic instructions and mapping specifications, requiring an efficient learning framework. To cope with this issue, we introduce an approach for learning flexible policies called Symbolic Instructions with Adjustable Mapping Specifications (SIAMS). This paper represents symbolic instructions using linear temporal logic (LTL), a formal language that can be easily integrated into RL. Our method addresses the diversified completion patterns of instructions by (1) a specification-aware state modulation, which embeds differences in mapping specifications in state features, and (2) a symbol-number-based task curriculum, which gradually provides tasks according to the learning’s progress. Evaluations in 3D simulations with discrete and continuous action spaces demonstrate that our method outperforms context-aware multitask RL comparisons.


WeAT12	311B
Robotic Imitation Learning 1	Regular Session
Chair: Gao, Anzhu	Shanghai Jiao Tong University

10:30-10:35, Paper WeAT12.1
Gaze-Guided Task Decomposition for Imitation Learning in Robotic Manipulation

Takizawa, Ryo	The University of Tokyo
Ohmura, Yoshiyuki	The University of Tokyo
Kuniyoshi, Yasuo	The University of Tokyo
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration Abstract: In imitation learning for robotic manipulation, decomposing object manipulation tasks into sub-tasks enables the reuse of learned skills and the combination of learned behaviors to perform novel tasks, rather than simply replicating demonstrated motions. Human gaze is closely linked to hand movements during object manipulation. We hypothesize that an imitating agent's gaze control—fixating on specific landmarks and transitioning between them—simultaneously segments demonstrated manipulations into sub-tasks. This study proposes a simple yet robust task decomposition method based on gaze transitions. Using teleoperation, a common modality in robotic manipulation for collecting demonstrations, in which a human operator's gaze is measured and used for task decomposition as a substitute for an imitating agent's gaze. Our approach ensures consistent task decomposition across all demonstrations for each task, which is desirable in contexts such as machine learning. We evaluated the method across demonstrations of various tasks, assessing the characteristics and consistency of the resulting sub-tasks. Furthermore, extensive testing across different hyperparameter settings confirmed its robustness, making it adaptable to diverse robotic systems. Our code is available at htt ps://github.com/crumbyRobotics/GazeTaskDecomp.

10:35-10:40, Paper WeAT12.2
Robust Instant Policy: Leveraging Student's T-Regression Model for Robust In-Context Imitation Learning of Robot Manipulation

Oh, Hanbit	National Institute of Advanced Industrial Science and Technology
Salcedo Vazquez, Andrea Marisol	National Institute of Advanced Industrial Science and Technology
Ramirez-Alpizar, Ixchel Georgina	National Institute of Advanced Industrial Science and Technology
Domae, Yukiyasu	The National Institute of Advanced Industrial Science and Techno
Keywords: Imitation Learning, Learning from Demonstration Abstract: Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations. Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations. However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student’s t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation. Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student's t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least 26% improvement in task success rates, particularly in low-data scenarios for everyday tasks. Video results available at https://sites.google.com/view/robustinstantpolicy.

10:40-10:45, Paper WeAT12.3
Towards Surgical Task Automation: Actor-Critic Models Meet Self-Supervised Imitation Learning

Liu, Jingshuai	University of Edinburgh
Andres, Alain	TECNALIA, Basque Research and Technology Alliance (BRTA)
Jiang, Yonghang	University of Strathclyde
Du, Yuning	University of Edinburgh
Luo, Xichun	University of Strathclyde
Shu, Wenmiao	University of Strathclyde
Pu, Can	Chongqing University
Tsaftaris, Sotirios	The University of Edinburgh
Keywords: Imitation Learning, Learning from Demonstration, Reinforcement Learning Abstract: Surgical robot task automation has recently attracted great attention due to its potential to benefit both surgeons and patients. Reinforcement learning (RL) based approaches have demonstrated promising ability to perform automated surgical manipulations on various tasks. To address the exploration challenge, expert demonstrations can be utilized to enhance the learning efficiency via imitation learning (IL) approaches. However, the successes of such methods normally rely on both states and action labels. Unfortunately, action labels can be hard to capture or their manual annotation is prohibitively expensive owing to the requirement for expert knowledge. Emulating expert behaviour using noisy or inaccurate labels poses significant risks, including unintended surgical errors that may result in patient discomfort or, in more severe cases, tissue damage. It therefore remains an appealing and open problem to leverage expert data composed of pure states into RL. In this work, we present an actor-critic RL framework, termed AC-SSIL, to overcome this challenge of improving learning process with state-only demonstrations collected by an unknown expert policy. It adopts a self-supervised IL method, dubbed SSIL, to effectively incorporate expert states into RL paradigms by retrieving from demonstrations the nearest neighbours of the query state and utilizing the bootstrapping of actor networks. It applies similarity-based regularization and improves its prediction capacity jointly with the actor network. We showcase through experiments on an open-source surgical simulation platform that our method delivers remarkable improvements over the RL baseline and exhibits comparable performance against action based IL methods, which implies the efficacy and potential of our method for expert demonstration-guided learning scenarios. Code will be made publicly available at https://github.com/Jingshuai-cqu/AC-SSIL.

10:45-10:50, Paper WeAT12.4
Symmetry-Guided Multi-Agent Inverse Reinforcement Learning

Tian, Yongkai	Beihang University
Qi, Yirong	Beihang University
Yu, Xin	Beihang University
Wu, Wenjun	Beihang University
Luo, Jie	Beihang University
Keywords: Imitation Learning, Multi-Robot Systems, Learning from Demonstration Abstract: In robotic systems, the performance of reinforcement learning depends on the rationality of predefined reward functions. However, manually designed reward functions often lead to policy failures due to inaccuracies. Inverse Reinforcement Learning (IRL) addresses this problem by inferring implicit reward functions from expert demonstrations. Nevertheless, existing methods rely heavily on large amounts of expert demonstrations to accurately recover the reward function. The high cost of collecting expert demonstrations in robotic applications, particularly in multi-robot systems, severely hinders the practical deployment of IRL. Consequently, improving sample efficiency has emerged as a critical challenge in multi-agent inverse reinforcement learning (MIRL). Inspired by the symmetry inherent in multi-agent systems, this work theoretically demonstrates that leveraging symmetry enables the recovery of more accurate reward functions. Building upon this insight, we propose a universal framework that integrates symmetry into existing multi-agent adversarial IRL algorithms, thereby significantly enhancing sample efficiency. Experimental results from multiple challenging tasks have demonstrated the effectiveness of this framework. Further validation in physical multi-robot systems has shown the practicality of our method.

10:50-10:55, Paper WeAT12.5
RECON: Reducing Causal Confusion with Human-Placed Markers

Ramirez Sanchez, Robert	Virginia Tech
Nemlekar, Heramb	Virginia Tech
Sagheb, Shahabedin	Virginia Tech
Nunez, Cara M.	Cornell University
Losey, Dylan	Virginia Tech
Keywords: Learning from Demonstration, Imitation Learning, Human-Robot Collaboration Abstract: Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot's observations include both task-relevant and extraneous information: for instance, a robot's camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human's examples and fails to learn the desired task. To address this issue, we highlight that --- while the robot learner may not know what to focus on --- the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot's observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching.

10:55-11:00, Paper WeAT12.6
UF-RNN: Real-Time Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction

Hiruma, Hyogo	Waseda University / Hitachi, Ltd
Ito, Hiroshi	Hitachi, Ltd. / Waseda University
Ogata, Tetsuya	Waseda University
Keywords: Imitation Learning, Bioinspired Robot Learning, Representation Learning Abstract: Training robots to operate effectively in environments with uncertain states—such as ambiguous object properties or unpredictable interactions—remains a longstanding challenge in robotics. Imitation learning methods typically rely on successful examples and often neglect failure scenarios where uncertainty is most pronounced. To address this limitation, we propose the Uncertainty-driven Foresight Recurrent Neural Network (UF-RNN), a model that combines standard time-series prediction with an active “Foresight” module. This module performs internal simulations of multiple future trajectories and refines the hidden state to minimize predicted variance, enabling the model to selectively explore actions under high uncertainty. We evaluate UF-RNN on a door-opening task in both simulation and a real-robot setting, demonstrating that, despite the absence of explicit failure demonstrations, the model exhibits robust adaptation by leveraging self-induced chaotic dynamics in its latent space. When guided by the Foresight module, these chaotic properties stimulate exploratory behaviors precisely when the environment is ambiguous, yielding improved success rates compared to conventional stochastic RNN baselines. These findings suggest that integrating uncertainty-driven foresight into imitation learning pipelines can significantly enhance a robot’s ability to handle unpredictable real-world conditions.

11:00-11:05, Paper WeAT12.7
Learning Manipulation Skills through Robot Chain-Of-Thought with Sparse Failure Guidance

Zhang, Kaifeng	Shanghai Qi Zhi Institute
Yin, Zhao-Heng	University of California, Berkeley
Ye, Weirui	Tsinghua University
Gao, Yang	Tsinghua University
Keywords: Imitation Learning, Reinforcement Learning, Manipulation Planning Abstract: Reward engineering for policy learning has been a long-standing challenge in robotics. Recently, to avoid manual reward designs, vision-language models (VLMs) have shown promise in defining rewards for teaching robots manipulation skills. However, existing work often provides reward guidance that is too coarse, leading to insufficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. Specifically, we decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a 5.4 times higher average success rates compared to the best baseline, RoboCLIP, across a series of manipulation tasks.

11:05-11:10, Paper WeAT12.8
Deep Predictive Learning with Proprioceptive and Visual Attention for Humanoid Robot Repositioning Assistance

Miyake, Tamon	Waseda University
Saito, Namiko	Microsoft Research Asia - Tokyo
Ogata, Tetsuya	Waseda University
Wang, Yushi	Waseda University
Sugano, Shigeki	Waseda University
Keywords: Dual Arm Manipulation, Imitation Learning, Physically Assistive Devices Abstract: Caregiving is a vital role for domestic robots, especially the repositioning care has immense societal value, critically improving the health and quality of life of individuals with limited mobility. However, repositioning task is a challenging area of research, as it requires robots to adapt their motions while interacting flexibly with patients. The task involves several key challenges: (1) applying appropriate force to specific target areas; (2) performing multiple actions seamlessly, each requiring different force application policies; and (3) adaptation under uncertain positional conditions. To address these, we propose a deep neural network (DNN)-based architecture utilizing proprioceptive and visual attention mechanisms, along with impedance control to regulate the robot's movements. Using the dual-arm humanoid robot Dry-AIREC, the proposed model successfully generated motions to insert the robot's hand between the bed and a mannequin's back without applying excessive force, and it supported the transition from a supine to a lifted-up position.


WeAT13	311C
Deep Learning for Visual Perception 5	Regular Session
Chair: Zhao, Hao	Tsinghua University
Co-Chair: Sharma, Radhe Shyam	IIT Mandi

10:30-10:35, Paper WeAT13.1
Distilling 3D Distinctive Local Descriptors for 6D Pose Estimation

Hamza, Amir	Fondazione Bruno Kessler (FBK)
Caraffa, Andrea	Fondazione Bruno Kessler
Boscaini, Davide	Fondazione Bruno Kessler
Poiesi, Fabio	Fondazione Bruno Kessler
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, RGB-D Perception Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi's effectiveness, while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility.

10:35-10:40, Paper WeAT13.2
Anomaly Knowledge Learning for Patch-Agnostic Defense against Adversarial Patches

Mu, Hongmin	Beijing University of Chemical Technology
Cao, Zhengcai	Harbin Institute of Technology
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Semantic Scene Understanding Abstract: Adversarial patch defense has made significant progress recently, but defending against natural-looking patches remains a challenge due to their content-agnostic nature. We hypothesize that these patches exhibit position-related anomalies and are inspired by anomaly detection techniques. However, directly applying existing anomaly detection methods to patch detection behaves poorly because patches are a subset of anomalies, and using general anomalous features may introduce irrelevant anomalies. Additionally, anomaly detection datasets are primarily derived from industrial scenes, leading to out-of-distribution issues. To address these challenges, we propose a patch-agnostic defense method based on anomaly knowledge learning. It fine-tunes the Segment Anything Model in a self-supervised manner using an anomaly dataset, enabling the model’s image encoder to generate embeddings with enhanced activation for anomalous regions.It also designs a Cross Attention Patch Decoder based on cross-modal attention mechanisms to compute the mutual information between patch prediction probability maps and anomaly activation maps for patch localization. Experimental results on public datasets demonstrate the effectiveness of the proposed method.

10:40-10:45, Paper WeAT13.3
GroupLane: End-To-End 3D Lane Detection with Channel-Wise Grouping

Li, Zhuoling	The University of Hong Kong
Han, Chunrui	MEGVII Technolegy
Ge, Zheng	Waseda University
Yang, Jinrong	Huazhong University of Science and Technology
Yu, En	Huazhong University of Science and Technology
Wang, Haoqian	Tsinghua University
Zhang, Xiangyu	Megvii Technology
Zhao, Hengshuang	The University of Hong Kong
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Intelligent Transportation Systems Abstract: Efficiency is quite important for 3D lane detection while previous detectors are either computationally expensive or difficult for optimization. To bridge this gap, we propose a fully convolutional detector named GroupLane, which is simple, fast, and still maintains high detection precision. Specifically, we first propose to split extracted feature into multiple groups along the channel dimension and employ every group to represent a prediction. In this way, GroupLane realizes end-to-end detection like DETR based on pure convolutional neural networks. Then, we propose to represent lanes by performing row-wise classification in bird’s eye view and devise a set of detection heads. Compared with existing row-wise classification implementations that only support recognizing vertical lanes, ours can detect both vertical and horizontal ones. Additionally, a matching algorithm named single-win one-to-one matching is developed to associate predictions with labels during training. Extensive experiments are conducted to verify the effectiveness of the proposed strategies, and the results suggest that GroupLane achieves the best performance with high inference speed on both the popular OpenLane and Once-3DLanes benchmarks. In addition, GroupLane is the first fully convolutional 3D lane detector that achieves end-to-end detection without post-processing.

10:45-10:50, Paper WeAT13.4
PosePilot: Steering Camera Pose for Generative World Models with Self-Supervised Depth

Jin, Bu	Institute of Automation, Chinese Academy of Sciences
Li, Weize	Tsinghua University
Yang, Baihan	Beijing Jiaotong University
Zhu, Zhenxin	Beihang University
Jiang, Junpeng	Harbin Institute of Technology (Shenzhen)
Gao, Huan-ang	Tsinghua University
Sun, Haiyang	LiAuto Inc
Zhan, Kun	LiAuto
Hu, Hengtong	LiAuto.com
Zhang, Xueyang	Li Auto Inc
Jia, Peng	Li Auto
Zhao, Hao	Tsinghua University
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and ego-motion readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

10:50-10:55, Paper WeAT13.5
LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection

Ren, Jing	RMIT University
Ma, Suyu	CSIRO's Data61
Jia, Hong	University of Auckland
Xu, Xiwei	CSIRO's Data61
Lee, Ivan	University of South Australia
Fayek, Haytham	RMIT University
Li, Xiaodong	RMIT University
Xia, Feng	RMIT University
Keywords: Deep Learning for Visual Perception, Surveillance Robotic Systems, Computer Vision for Transportation Abstract: Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly reduced computational complexity and latency compared to the state-of-the-art methods. This work advances the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.

10:55-11:00, Paper WeAT13.6
M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

Udugama Vithanage, Bavantha Lakshan Udugama	University of Twente
Vosselman, George	University of Twente
Nex, Francesco	University of Twente
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Semantic Scene Understanding Abstract: Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. In this paper, we introduce Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the backbone for monocular spatial perception systems, a framework for 3D scene graph construction in dynamic environments. Comprehensive evaluations demonstrate that M2H outperforms state-of-the-art (SOTA) multi-task models on NYUDv2, exceeds single-task depth and semantic baselines on Hypersim, and achieves superior performance on Cityscapes datasets, all while maintaining computational efficiency on laptop hardware. Beyond curated benchmarks, we validate M2H on real-world data, demonstrating its practicality in spatial perception tasks. We provide our implementation and pretrained models at https://github.com/UAV-Centre-ITC/M2H.git.

11:00-11:05, Paper WeAT13.7
Spectral-Temporal Attention for Robust Change Detection

Thakur, Mayank	Indian Institute of Technology, Mandi
Sharma, Radhe Shyam	IIT Mandi
Kurmi, Vinod	IISER Bhopal
Samant, Raj	Nvidia Graphics Private Limited
Patro, BADRI Narayana	Microsoft
Keywords: Deep Learning for Visual Perception, AI-Based Methods, Data Sets for Robotic Vision Abstract: Change detection has long been used for various tasks. With advancements in robotic systems and computer vision, change detection techniques can be further explored for diverse applications. Current state-of-the-art methods primarily use either satellite images or street-level images to detect changes. However, the techniques used for these two types of images differ substantially, though their core objective remains identical. We introduce a spectral-temporal attention network capable of detecting changes in both satellite and street-level images across multiple temporal instances. Additionally, we present an indoor environmental dataset featuring significantly more frequent changes. We analyze the impact of temporal and spatial domain shifts on the performance of various methods and demonstrate that performing attention in the spectral domain not only enhances overall performance but also increases robustness against spatial domain shifts.

11:05-11:10, Paper WeAT13.8
VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation

Lian, Ruyi	Stony Brook University
Lin, Yuewei	Brookhaven National Laboratory
Latecki, Longin Jan	Temple Univ
Ling, Haibin	Stony Brook University
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation Abstract: Localizing predefined 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation. However, unreliable localization results of invisible keypoints degrade the quality of correspondences. In this paper, we address this issue by localizing the important keypoints in terms of visibility. Since keypoint visibility information is currently missing in the dataset collection process, we propose an efficient way to generate binary visibility labels from available object-level annotations, for keypoints of both asymmetric objects and symmetric objects. We further derive real-valued visibility-aware importance from binary labels based on the PageRank algorithm. Taking advantage of the flexibility of our visibility-aware importance, we construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm, along with additional positional encoding. VAPO can work in both CAD-based and CAD-free settings. Extensive experiments are conducted on popular pose estimation benchmarks including Linemod, Linemod-Occlusion, and YCB-V, demonstrating that VAPO clearly achieves state-of-the-art performances. Project page: https://github.com/RuyiLian/VAPO.


WeAT14	311D
Learning from Demonstration 1	Regular Session
Co-Chair: Biagiotti, Luigi	University of Modena and Reggio Emilia

10:30-10:35, Paper WeAT14.1
Versatile Demonstration Interface: Toward More Flexible Robot Demonstration Collection

Hagenow, Michael	Massachusetts Institute of Technology
Kontogiorgos, Dimosthenis	Massachusetts Institute of Technology
Wang, Yanwei	MIT
Shah, Julie A.	MIT
Keywords: Human Factors and Human-in-the-Loop, Learning from Demonstration Abstract: Previous methods for Learning from Demonstration leverage several approaches for a human to teach motions to a robot, including teleoperation, kinesthetic teaching, and natural demonstrations. However, little previous work has explored more general interfaces that allow for multiple demonstration types. Given the varied preferences of human demonstrators and task characteristics, a flexible tool that enables multiple demonstration types could be crucial for broader robot skill training. In this work, we propose Versatile Demonstration Interface (VDI), an attachment for collaborative robots that simplifies the collection of three common types of demonstrations. Designed for flexible deployment in industrial settings, our tool requires no additional instrumentation of the environment. Our prototype interface captures human demonstrations through a combination of vision, force sensing, and state tracking (e.g., through the robot proprioception or AprilTag tracking). Through a user study where we deployed our prototype VDI at a local manufacturing innovation center with manufacturing experts, we demonstrated VDI in representative industrial tasks. Interactions from our study highlight the practical value of VDI's varied demonstration types, expose a range of industrial use cases for VDI, and provide insights for future tool design.

10:35-10:40, Paper WeAT14.2
Look before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation

Mu, Tong	Johns Hopkins University
Liu, Yihao	Johns Hopkins University
Armand, Mehran	Johns Hopkins University
Keywords: Imitation Learning, AI-Enabled Robotics, Learning from Demonstration Abstract: Imitation learning frameworks for robotic manipulation have drawn attention in the recent development of language model grounded robotics. However, the success of the frameworks largely depends on the coverage of the demonstration cases: When the demonstration set does not include examples of how to act in all possible situations, the action may fail and can result in cascading errors. To solve this problem, we propose a framework that uses serialized Finite State Machine (FSM) to generate demonstrations and improve the success rate in manipulation tasks requiring a long sequence of precise interactions. To validate its effectiveness, we use environmentally evolving and long-horizon puzzles that require long sequential actions. Experimental results show that our approach achieves a success rate of up to 98% in these tasks, compared to the controlled condition using existing approaches, which only had a success rate of up to 60%, and, in some tasks, almost failed completely. The source code for this project can be accessed at https://imitate.finite-state.com/.

10:40-10:45, Paper WeAT14.3
Distilling Realizable Students from Unrealizable Teachers

Kim, Yujin	Cornell University
Choudhury, Sanjiban	Cornell University
Chin, Nathaniel	Cornell University
Vasudev, Arnav	Cornell University
Keywords: Imitation Learning, Learning from Demonstration, Reinforcement Learning Abstract: We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher’s state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher querying only when necessary and resetting from recovery states to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance.

10:45-10:50, Paper WeAT14.4
Arc-Length-Based Warping for Robot Skill Synthesis from Multiple Demonstrations

Braglia, Giovanni	Istituto Italiano Di Tecnologia
Tebaldi, Davide	University of Modena and Reggio Emilia
Lazzaretti, Andre	Federal University of Technology of Parana
Biagiotti, Luigi	University of Modena and Reggio Emilia
Keywords: Learning from Demonstration, Motion and Path Planning, Datasets for Human Motion Abstract: In robotics, Learning from Demonstration (LfD) aims to transfer skills to robots by using multiple demonstrations of the same task. These demonstrations are recorded and processed to extract a consistent skill representation. This process typically requires temporal alignment through techniques such as Dynamic Time Warping (DTW). In this paper, we consider a novel algorithm, named Spatial Sampling (SS), specifically designed for robot trajectories, that enables time-independent alignment of the trajectories by providing an arc-length parametrization of the signals. This approach eliminates the need for temporal alignment, enhancing the accuracy and robustness of skill representation, especially when recorded movements are subject to intermittent motions or extremely variable speeds, a common characteristic of operations based on kinesthetic teaching, where the operator may encounter difficulties in guiding the end-effector smoothly. To prove this, we built a custom publicly available dataset of robot recordings to test real-world movements, where the user tracks the same geometric path multiple times, with motion laws that vary greatly and are subject to starting and stopping. The SS demonstrates better performances against state-of-the-art algorithms in terms of (i) trajectory synchronization and (ii) quality of the extracted skill.

10:50-10:55, Paper WeAT14.5
Learning Generalizable 3D Manipulation with 10 Demonstrations

Ren, Yu	Shenyang Institute of Automation Chinese Academy of Sciences
Cong, Yang	Chinese Academy of Science, China
Huang, BoHao	South China University of Technology
Long, JiaHao	South China University of Technology
Chen, Ronghan	Sheyang Institute of Automation, Chinese Academy of Sciences
Li, HongBo	Tsinghua University
Fan, Huijie	Shenyang Institute of Automation
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Learning robust and generalizable manipulation skills from few demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. While recent imitation learning methods have achieved impressive results, they often require large amounts of demonstration data and struggle to generalize across different spatial variants. In this work, we propose a framework that learns 3D manipulation policies from only 10 demonstrations while achieving robust generalization to unseen spatial configurations. Our framework consists of two key modules: a Semantic Guided Perception module that extracts task-aware 3D representations from RGB-D inputs using semantic priors, and a Spatial Generalized Decision module implementing a diffusion-based policy that preserves spatial equivariance through denoising. Central to our framework is a spatially equivariant training strategy, which adapts 2D data augmentation principles to 3D manipulation by maintaining gripper-object spatial relationships during trajectory augmentation. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a 60–70% improvement in success rates over state-of-the-art approaches on a series of challenging tasks, particularly under significant object pose variations. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.

10:55-11:00, Paper WeAT14.6
One-Shot Robust Imitation Learning for Long-Horizon Visuomotor Tasks from Unsegmented Demonstrations

Wu, Shaokang	University of Leeds
Wang, Yijin	University of Leeds
Huang, Yanlong	University of Leeds
Keywords: Learning from Demonstration, Imitation Learning Abstract: In contrast to single-skill tasks, long-horizon tasks play a crucial role in our daily life, e.g., a pouring task requires a proper concatenation of reaching, grasping and pouring subtasks. As an efficient solution for transferring human skills to robots, imitation learning has achieved great progress over the last two decades. However, when learning long-horizon visuomotor skills, imitation learning often demands a large amount of semantically segmented demonstrations. Moreover, the performance of imitation learning could be susceptible to external perturbation and visual occlusion. In this paper, we exploit dynamical movement primitives and meta-learning to provide a new framework for imitation learning, called Meta-Imitation Learning with Adaptive Dynamical Primitives (MiLa). MiLa allows for learning unsegmented long-horizon demonstrations and adapting to unseen tasks with a single demonstration. MiLa can also resist external disturbances and visual occlusion during task execution. Real-world robotic experiments demonstrate the superiority of MiLa, irrespective of visual occlusion and random perturbations on robots.


WeAT15	206
Computer Vision 1	Regular Session
Chair: Xie, Biyun	University of Kentucky

10:30-10:35, Paper WeAT15.1
Lanes Are Not Enough: Enhancing Trajectory Prediction in Intralogistics through Detailed Environmental Context

Prutsch, Alexander	Graz University of Technology
Wess, Matthias	TU Wien
Possegger, Horst	Graz University of Technology
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Logistics Abstract: Trajectory prediction is an essential component of the perception stack in autonomous mobile robots (AMRs). AMRs operate in complex environments where their movements are influenced by various environment elements, such as racks and storage locations. Therefore, accurate and efficient trajectory prediction for intralogistics requires detailed environment modeling that goes beyond the lane-based context mainly used in road traffic methods. We propose the addition of a new environment context encoder module that can be seamlessly integrated into state-of-the-art autonomous driving systems. Our approach, tailored to the specific challenges of intralogistics, achieves highly accurate predictions using compact and efficient baseline networks.

10:35-10:40, Paper WeAT15.2
Pseudo Depth Meets Gaussian: A Feed-Forward RGB SLAM Baseline

Zhao, Linqing	Tsinghua University
Xu, Xiuwei	Tsinghua University
Wang, Yirui	Tsinghua University
Wang, Hao	Beijing University of Posts and Telecommunications
Zheng, Wenzhao	Tsinghua University
Tang, Yansong	Tsinghua University
Yan, Haibin	Beijing University of Posts and Telecommunications
Lu, Jiwen	Tsinghua University
Keywords: Computer Vision for Automation, SLAM Abstract: Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feedforward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%. Code is available at https://github.com/wangyr22/DepthGS

10:40-10:45, Paper WeAT15.3
Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition

Shang, Tianyi	Fuzhou University
Zhenyu, Li	Qilu University of Technology (Shandong Academy of Sciences)
Xu, Pengjie	Shanghai Jiaotong University
Qiao, Jinwei	Qilu University of Technology
Chen, Gang	Xiamen University
Ruan, Zihan	FUZHOU University
Hu, Weijun	Fuzhou University
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Vision-Based Navigation Abstract: Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360° views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training stage, Text4VPR emphasizes the alignment between individual text-image pairs for precise textual description. In the inference stage, Text4VPR uses the Cascaded Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch between text and image groups. Subsequently, Text4VPR performs precisely place match based on the descriptions of text-image groups. On Street360Loc, the first text to image VPR dataset we created, Text4VPR builds a robust baseline, achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92% within a 5-meter radius on the test set, which indicates that localization from textual descriptions to images is not only feasible but also holds significant potential for further advancement, as shown in Figure 1. Our code is available at https://github.com/nuozimiaowu/Text4VPR.

10:45-10:50, Paper WeAT15.4
Automated 3D-GS Registration and Fusion Via Skeleton Alignment and Gaussian-Adaptive Features

Shiyang, Liu	Beijing Institute of Technology
Yang, Dianyi	Beijing Institute of Technology
Gao, Yu	Beijing Institude of Technology
Ren, Bohan	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Computer Vision for Automation, RGB-D Perception, Vision-Based Navigation Abstract: In recent years, 3D Gaussian Splatting (3D-GS)-based scene representation demonstrates significant potential in real-time rendering and training efficiency. However, most existing methods primarily focus on single-map reconstruction, while the registration and fusion of multiple 3D-GS sub-maps remain underexplored. Existing methods typically rely on manual intervention to select a reference sub-map as a template and use point cloud matching for registration. Moreover, hard-threshold filtering of 3D-GS primitives often degrades rendering quality after fusion. In this paper, we present a novel approach for automated 3D-GS sub-map alignment and fusion, eliminating the need for manual intervention while enhancing registration accuracy and fusion quality. First, we extract geometric skeletons across multiple scenes and leverage ellipsoid-aware convolution to capture 3D-GS attributes, facilitating robust scene registration. Second, we introduce a multi-factor Gaussian fusion strategy to mitigate the scene element loss caused by rigid thresholding. Experiments on the ScanNet-GSReg and our Coord datasets demonstrate the effectiveness of the proposed method in registration and fusion. For registration, it achieves a 41.9% reduction in RRE on complex scenes, ensuring more precise pose estimation. For fusion, it improves PSNR by 10.11 dB, highlighting superior structural preservation. These results confirm its ability to enhance scene alignment and reconstruction fidelity, ensuring more consistent and accurate 3D scene representation for robotic perception and autonomous navigation.

10:50-10:55, Paper WeAT15.5
Robust Maritime Object Detection under Adverse Conditions Via Joint Semantic Learning without Extra Computational Overhead

Lee, Junseok	GIST(Gwangju Institute of Science and Technology)
Lee, Seongju	Gwangju Institue of Science and Technology (GIST)
Kim, Jong-Won	GIST(Gwangju Institute of Science and Technology)
Park, Jumi	Gwangju Institute of Science and Technology
Lee, Kyoobin	Gwangju Institute of Science and Technology
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Recognition Abstract: This study addresses the challenge of robust object detection in maritime environments, where dynamic conditions such as fog, brightness variations, and motion blur can degrade accuracy. We propose a novel framework, Joint Semantic Learning (JSL), which combines ocean scene segmentation and object detection to improve both performance and robustness. JSL incorporates the ocean scene segmentation module into the detection network during training and removes it during inference, ensuring no additional computational overhead. Through ocean scene segmentation, the feature extractor learns to understand the overall context of the image and extract detailed information about objects. Extensive experiments show that JSL, applied to various convolutional neural network-based detectors, achieves significant performance improvements on maritime datasets SMD and SeaShips. Notably, the proposed method shows substantial performance gains on the SMD-C and SeaShips-C datasets, which include adverse conditions, demonstrating the robustness of the proposed method. Furthermore, experiments comparing our method with existing state-of-the-art multi-task methods on the Cityscapes dataset validate its effectiveness in generalizing to urban environments. The efficient integration of spatial and semantic information of JSL ensures accurate and reliable object detection across diverse applications. Our code is available at: https://github.com/gist-ailab/JSL.

10:55-11:00, Paper WeAT15.6
GeoScene: Temporal 3D Semantic Scene Completion with Geometric Correlation between Images

Zhu, Xiaoyu	Hunan University
Zhang, Xiaogang	Hunan University
Chen, Hua	Hunan University
Wang, Yaonan	Hunan University
Miao, Zhiqiang	Hunan University
Liu, Kangcheng	Hunan University (HNU); Previously with the California Institute
Keywords: Computer Vision for Automation, Visual Learning Abstract: Semantic Scene Completion (SSC) aims to reconstruct the entire 3D scene in terms of both occupancy and semantics, serving as a fundamental task for autonomous driving and robotic systems. Camera-based methods have seen significant advancements due to their low cost and rich visual cues. However, previous approaches have predominantly focused on semantic recovery. This can lead to inaccurate occupancy predictions and, consequently, the failure of downstream tasks such as trajectory planning. To address this limitation, we propose a novel multi-frame matching framework, GeoScene, which reconstructs spatial structures through inter-frame geometric correlations of temporal images and subsequently infers scene semantic information. Specifically, we extract features from distinct frames in the depth dimension and derive depth features by constructing a cost volume. Following this, dot product and voxelization operations are applied between the extracted features and depth features to correct assignment errors. Furthermore, we introduce a surface normal-based regression loss to preserve fine-grained surface structures. Extensive experiments on the SemanticKITTI dataset demonstrate that GeoScene outperforms existing state-of-the-art methods.

11:00-11:05, Paper WeAT15.7
TopoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation

Liu, Jiuming	Shanghai Jiao Tong University
Huang, Zheng	Shanghai Jiao Tong University
Liu, Mengmeng	University of Twente
Deng, Tianchen	Shanghai Jiao Tong University
Nex, Francesco	University of Twente
Cheng, Hao	University of Twente
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, AI-Based Methods Abstract: LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Fréchet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.

11:05-11:10, Paper WeAT15.8
UAV-MaLO: Mamba-Augmented YOLO Hybrid Architecture for UAV Micro-Object Detection in Autonomous Robotics

Wei, Lennox	Harbin Institute of Technology
Sun, Shixin	Harbin Institute of Technology
Yao, Jiaqi	Harbin Institute of Technology
Mi, Yachun	Harbin Institute of Technology
Sui, Xiangyu	Meituan
Chen, Heng	Kuaishou Technology
Liu, Shaohui	Harbin Institute of Technology
Keywords: Computer Vision for Automation, Object Detection, Segmentation and Categorization, Deep Learning Methods Abstract: The rapid advancement of drone technology has led to the widespread application of micro-object detection in Unmanned Aerial Vehicle (UAV) systems. However, with the constraint of real-time computation, critical challenges remain in addressing extreme scale variations, low-resolution signatures and dense occlusions. For object detection task, although YOLO-based detectors outperform transformer models in efficiency-accuracy balance, their limited capacity for global context modeling and feature discriminability in complex aerial environments hinders optimal performance. To overcome these limitations, we introduce UAV-MaLO, a novel framework that incorporates state space modeling principles into YOLO's architecture. By introducing the abilities of long-range dependency modeling and adaptive spatial-frequency fusion, the proposed approach dynamically optimizes receptive fields while suppressing background interference, achieving robust micro-object localization in cluttered scenarios. Furthermore, the parallelized attention mechanism and the hierarchical feature refinement further ensure real-time processing capabilities without compromising detection precision, establishing a new paradigm for UAV deployment. Our experimental results on the VisDrone-2019-DET dataset reveal a significant improvement in various variants of average precision (AP), indicating the extraordinary performance of our UAV-MaLO.


WeAT16	207
Prosthetics and Exoskeletons 1	Regular Session
Chair: Wensing, Patrick M.	University of Notre Dame

10:30-10:35, Paper WeAT16.1
Online Trajectory Generation with Variable Geometry Stair Ascent for Powered Exoskeleton

Zhang, Fan	HKUST
Shi, Ling	The Hong Kong University of Science and Technology
Li, Shilei	Beijing Institute of Technology
Keywords: Prosthetics and Exoskeletons, Task and Motion Planning, Body Balancing Abstract: Powered exoskeletons offer a promising solution for individuals who require assistance with daily activities. In addition to walking on flat terrains, they are evolving to handle advanced scenarios such as stairs, allowing for a wider range of activities. To facilitate this advancement, an effective and safe tool is required for the planning and generation of gait trajectories. This study proposes a novel online trajectory generation method for an exoskeleton to aid in stair ascent. Initially, a finite state machine model is designed to assist the user in shifting their center of gravity. Subsequently, a B´ezier curve-based path interpolation approach is introduced to generate the path between each state. Finally, a time scaling and path reparameterization method is employed to avoid the singularity problem. Two kinds of numerical simulations and real exoskeleton experiments demonstrate that the proposed method can effectively and safely generate the trajectories to assist users in ascending stairs. Additionally, the effectiveness of the exoskeleton’s assistance is verified through a comparison of electromyography (EMG) signals from seven muscles. The results show that there was a reduction in muscle activation ranging from 22% to 81% for the different muscles analyzed.

10:35-10:40, Paper WeAT16.2
Robust Gait Phase Estimation with Discrete Wavelet Transform for Walking Assistance on Multiple Terrains

Zhou, Libo	Zhejiang University of Technology
Jiang, Feiyu	Zhejiang University of Technology
Bai, Shaoping	Aalborg University
Feng, Yuanjing	Zhejiang University of Technology
Ou, Linlin	Zhejiang University of Technology
Yu, Xinyi	Zhejiang University of Technology
Keywords: Prosthetics and Exoskeletons, Physically Assistive Devices, Intention Recognition Abstract: Gait phase detection is crucial to realize personalized assistive functions of lower limb exoskeletons. A common method in gait phase estimation is the adaptive oscillator, which performs well in periodic gaits. However, these types of methods fail in gait phase estimation under aperiodic gait cycles. Although some modified methods have been proposed for gait phase estimation under multiple terrains, they usually require dataset training, and the estimation accuracy is highly dependent on the collected dataset. To realize accurate and stable recognition of gait phase, a novel gait phase estimation method is proposed to estimate the gait phase without dataset training. This method, by incorporating discrete wavelet transform (DWT) with adaptive oscillators (AOs), can identify non-periodic mutations online and reset the oscillator at an appropriate time to avoid the divergence phenomenon when the adaptive oscillator is subjected to mutation signals. In this proposed method, the hip angle is measured by an inertial measurement unit (IMU) sensor and the measured data is then processed using discrete wavelet transform to detect the maximum hip flexion angle (MFA) and non-periodic mutations. The gait phase is finally estimated by a modified adaptive oscillator. Ground walking tests with a variety of speeds by six subjects were conducted under different walking conditions and the results show that the new method performs well in gait phase estimation under multiple terrains. The method is demonstrated on a hip exoskeleton in walking assistance.

10:40-10:45, Paper WeAT16.3
Mirror Adaptive Impedance Control of Multi-Mode Soft Exoskeleton with Reinforcement Learning (I)

Xu, Jiajun	Nanjing University of Aeronautics and Astronautics
Huang, Kaizhen	Nanjing University of Aeronautics and Astronautics
Zhang, Tianyi	Nanjing University of Aeronautics and Astronaut
Zhao, Mengcheng	Nanjing University of Aeronautics and Astronautics
Ji, Aihong	Nanjing University of Aeronautics Ans Astronautics
Li, You-Fu	City University of Hong Kong
Keywords: Prosthetics and Exoskeletons, Human-Centered Robotics, Reinforcement Learning Abstract: Soft exoskeleton robots (exosuits) have exhibited promising potentials in walking assistance with comfortable wearing experience. In this paper, a twisted string actuator (TSA) is developed and equipped with the exosuit to provide powerful driving force and variable assistance intensity for hemiplegic patients, which provides human-domain and robotdomain training modes for subjects with different movement capabilities. Since the human-exosuit coupling dynamics is difficult to be modeled due to the soft structure of the exosuit and incomplete knowledge of the wearer’s performance, accurate control and efficient assistance cannot be guaranteed in current exosuits. By taking advantage of the motion characteristic of hemiplegic patients, a mirror adaptive impedance control is proposed, where the robotic actuation is modulated based on the motion and physiological reference of the healthy limb (HL) as well as the performance of the impaired limb (IL). A linear quadratic regulation (LQR) is formulated to minimize the bilateral trajectory tracking errors and human effort, and the adaptation between the human-domain and robot-domain modes can be realized. A reinforcement learning (RL) algorithm is designed to solve the given LQR problem to optimize the impedance parameters with little information of the human or robot model. The proposed robotic system is validated through experiments to perform its effectiveness and superiority.

10:45-10:50, Paper WeAT16.4
Using Upper Limb Carrying Exoskeleton with Dual-Model Torque Control Strategy to Reduce Load Impact

Liu, Daming	Harbin Institute of Technology
Li, Ye	Harbin Institute of Technology
Liu, Junchen	Harbin Institute of Technology
Wang, Ziqi	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Zhu, Yanhe	Harbin Institute of Technology
Keywords: Prosthetics and Exoskeletons, Model Learning for Control, Multi-Modal Perception for HRI Abstract: Exoskeleton technology holds significant promise within the human-centric paradigm of Industry 5.0 for mitigating work-related musculoskeletal disorders (WMSDs). However, existing systems often struggle with mismatched assistive torque and inefficient human-machine collaboration under dynamic loading conditions, largely due to insufficient motion intent recognition accuracy. This study proposes a dual-model-based multimodal fusion control strategy that integrates a bidirectional LSTM neural network (Bi-LSTM) with a transformer-based multi-task learning model (MTL) to enable real-time torque compensation and accurate prediction of dynamic load mass under varying conditions. The team developed a lightweight elbow joint exoskeleton prototype, leveraging multi-modal information to enhance assistive torque prediction accuracy. Experimental results show an 83.7% reduction in agonist muscle activation under a 3.5 kg load compared to conditions without the exoskeleton, underscoring its potential for industrial material handling scenarios.

10:50-10:55, Paper WeAT16.5
Simultaneous Locomotion Mode Classification and Continuous Gait Phase Estimation for Transtibial Prostheses

Posh, Ryan	University of Michigan
Li, Shenggao	University of Notre Dame
Wensing, Patrick M.	University of Notre Dame
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Intention Recognition Abstract: Recognizing and identifying human locomotion is a critical step to ensuring fluent control of wearable robots, such as transtibial prostheses. In particular, classifying the intended locomotion mode and estimating the gait phase are key. In this work, a novel, interpretable, and computationally efficient algorithm is presented for simultaneously predicting locomotion mode and gait phase. Using able-bodied (AB) and transtibial prosthesis (PR) data, seven locomotion modes are tested including slow, medium, and fast level walking (0.6, 0.8, and 1.0 m/s), ramp ascent/descent (5 degrees), and stair ascent/descent (20 cm height). Overall classification accuracy was 99.1% and 99.3% for the AB and PR conditions, respectively. The average gait phase error across all data was less than 4%. Exploiting the structure of the data, computational efficiency reached 2.91μs per time step. The time complexity of this algorithm scales as O(N·M) with the number of locomotion modes M and samples per gait cycle N. This efficiency and high accuracy could accommodate a much larger set of locomotion modes (∼700 on Open-Source Leg Prosthesis) to handle the wide range of activities pursued by individuals during daily living.

10:55-11:00, Paper WeAT16.6
Enhancing Multi-Task Motion Planning Based on Improved DMPs for Lower Limb Prostheses

An, Hong-Lei	National University of Defense Technology
Huang, Yongshan	National University of Defense Technology
Nie, Yiming	National Innovation Institute of Defense Technology
Ma, Hongxu	National University of Defense Technology
Keywords: Prosthetics and Exoskeletons, Human-Aware Motion Planning Abstract: Achieving natural locomotion across diverse environments with prosthetic limbs remains a significant challenge for amputees. Intelligent prosthetics leverage motion planning techniques using phase variables to emulate natural gait aligned with human movement intentions. However, traditional phase variable-based planning, which utilizes geometric human motion models, often lacks robustness when encountering external disturbances. Additionally, models derived from human walking data can only approximate a limited set of discrete tasks, hindering the construction of a comprehensive model. In this study, we present an advanced prosthetic motion planning approach that integrates Dynamic Motion Primitives (DMPs) to ensure robust performance across multiple tasks. We demonstrate that DMPs with human-in-the-loop effectively simulate human joint movement trajectories under various task conditions. Furthermore, we introduce a novel Multi-Task Dynamic Motion Primitives with Singular Value Decomposition (DMPs-SVD) method, which incorporates multiple feature trajectory learning. This approach constructs a coherent task model using a limited dataset of typical human walking patterns, enabling joint motion planning across diverse task scenarios. Experimental results validate the viability and efficacy of the proposed human-in-loop DMPs and DMPs-SVD techniques in prosthetic applications.

11:00-11:05, Paper WeAT16.7
IMU-Based Motion Mode Recognition in Soft Underwater Exosuit

Luan, Mengbo	Shenzhen Institute of Advanced Technology，Chinese Academy
Wang, Xiangyang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Wang, Xufei	Shenzhen International Graduate School, Tsinghua University
Hong, Yongxuan	Southern University of Science and Technology
Ma, Yue	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Chen, Chunjie	Shenzhen Institutes of Advanced Technology，Chinese Academ
Wu, Xinyu	CAS
Keywords: Prosthetics and Exoskeletons, AI-Based Methods, Marine Robotics Abstract: By accurately recognizing the wearer’s motion, the underwater exoskeleton enables more efficient human-machine collaboration and provides enhanced assistance in complex and dynamic underwater environments. In this study, we propose a soft underwater exosuit motion mode recognizer based on a long short-term memory network and convolutional neural networks, referred to as LSTM-CNN. This model is designed to perform two tasks: motion mode classification and state transition label recognition. First, the LSTM network extracts features from the time-series data, followed by further feature extraction and classification using the convolutional and fully connected networks. The recognition of motion modes relies on three IMU sensors placed on the left and right legs and the back of the torso of the soft underwater exosuit. On the dataset containing four classes, including non-assist, breaststroke, flutter kick, and underwater walking, LSTM-CNN achieved an overall accuracy of 99.943±0.006% in motion mode classification and 92.101±0.054% in state transition label recognition. The experimental results indicate that the LSTM-CNN achieves better accuracy and performs optimally across various evaluation metrics compared to the other methods.

11:05-11:10, Paper WeAT16.8
Exploring the Virtual Pivot Point in Unilateral Transfemoral Amputee Locomotion: Implications for Prosthetic Development

Mohseni, Omid	Technische Universität Darmstadt
Ebrahimian, Serajeddin	University of Eastern Finland
Firouzi, Vahid	Technical University of Darmstadt
Khosrotabar, Morteza	Technical University of Darmstadt, Institute of Sport Science, L
Kupnik, Mario	Technische Universität Darmstadt
Sharbafi, Maziar	Technische Universität Darmstadt
Seyfarth, Andre	TU Darmstadt
Keywords: Prosthetics and Exoskeletons, Human and Humanoid Motion Analysis and Synthesis Abstract: The virtual pivot point (VPP), a theoretical convergence point of ground reaction forces during gait, has gained attention for its potential to uncover underlying locomotor control strategies. Here, we present the first investigation of VPP in individuals with unilateral above-knee amputation, using a publicly available dataset of 18 participants. Subjects were categorized into K2 (walking speeds 0.4–0.8 m/s) and K3 (0.6–1.4 m/s) functional levels. Our findings show that both groups demonstrate high sagittal-plane VPP quality, comparable to that of healthy individuals, with R2 > 95%, indicating a strong relationship between VPP formation and sagittal plane dynamics. Conversely, in the frontal plane, VPP analysis reveals greater variability and lower quality, indicating the absence of a well-defined pivot during gait. Notably, frontal-plane VPP quality deteriorates with increasing walking speed, particularly in K3 ambulators. While this speed-dependency is observed in healthy individuals as well, the rate of decline is significantly steeper in amputees. Additionally, spatial analysis of VPP positions reveals a consistent elevation of the amputated leg’s VPP compared to the intact leg. These findings emphasize the importance of frontal plane dynamics in amputee gait and suggest improvements in prosthetic design to enhance control and promote more symmetrical, natural gait.


WeAT17	210A
Intelligent Transportation Systems 1	Regular Session
Chair: Adouane, Lounis	Université De Technologie De Compiègne (France)
Co-Chair: Liu, Jia	ShenZhen Institutes of Advanced Technology, Chinese Academy of Sciences

10:30-10:35, Paper WeAT17.1
Heterogeneous Mixed Traffic Control and Coordination

Islam, Md Iftekharul	University of Tennessee, Knoxville
Li, Weizi	University of Tennessee, Knoxville
Wang, Xuan	George Mason University
Li, Shuai	University of Florida
Heaslip, Kevin	University of Tennessee Knoxville
Keywords: Multi-Robot Systems, Intelligent Transportation Systems, Human-Robot Teaming Abstract: Urban intersections with diverse vehicle types, from small cars to large semi-trailers, pose significant challenges for traffic control. This study explores how robot vehicles (RVs) can enhance heterogeneous traffic flow, particularly at unsignalized intersections where traditional methods fail during power outages. Using reinforcement learning (RL) and real-world data, we simulate mixed traffic at complex intersections with RV penetration rates ranging from 10% to 90%. Results show that average waiting times drop by up to 86% and 91% compared to signalized and unsignalized intersections, respectively. We observe a "rarity advantage," where less frequent vehicles benefit the most (up to 87%). Although CO2 emissions and fuel consumption increase with RV penetration, they remain well below those of traditional signalized traffic. Decreased space headways also indicate more efficient road usage. These findings highlight RVs' potential to improve traffic efficiency and reduce environmental impact in complex, heterogeneous settings.

10:35-10:40, Paper WeAT17.2
Beacon: A Naturalistic Driving Dataset During Blackouts for Benchmarking Traffic Reconstruction and Control

Sarker, Supriya	University of Tennessee, Knoxville
Islam, Md Iftekharul	University of Tennessee, Knoxville
Poudel, Bibek	University of Tennessee Knoxville
Li, Weizi	University of Tennessee, Knoxville
Keywords: Intelligent Transportation Systems, Datasets for Human Motion, Simulation and Animation Abstract: Extreme weather and infrastructure vulnerabilities pose significant challenges to urban mobility, particularly at intersections where signals become inoperative. To address this growing concern, we introduce Beacon, a naturalistic driving dataset capturing traffic dynamics during blackouts at two major intersections in Memphis, TN, USA. The dataset provides detailed traffic movements, including timesteps, origin, and destination lanes for each vehicle over four hours of peak periods. We analyze traffic demand, vehicle trajectories, and density across different scenarios, demonstrating high-fidelity reconstruction under unsignalized, signalized, and mixed traffic conditions. We find that integrating robot vehicles (RVs) into traffic flow can substantially reduce intersection delays, with wait time improvements of up to 82.6%. However, this enhanced traffic efficiency comes with varying environmental impacts, as decreased vehicle idling may lead to higher overall CO2 emissions. To the best of our knowledge, Beacon is the first publicly available traffic dataset for naturalistic driving behaviors during blackouts at intersections.

10:40-10:45, Paper WeAT17.3
Reliable Multi-Level Optimization for Safe Predictive Control of Autonomous Vehicles to Avoid Uncertain Multimodal PLEVs

Alao, Emmanuel	UTC Compiegne and INRIA Sophia Antipolis
Adouane, Lounis	Université De Technologie De Compiègne (France)
Martinet, Philippe	INRIA
Keywords: Intelligent Transportation Systems, Motion and Path Planning, Human-Aware Motion Planning Abstract: Safety assurance using all perceptual information to predict the motion of dynamic agents is critical in urban environments and remains an open challenge. For Autonomous Vehicles (AV) operating around vulnerable road users, the risk assessment strategy often needs to address stochastic uncertainties in the multiple possible trajectories (or multimodal motion) of the surrounding traffic agents. However, this increases the complexity of the navigation problem using the existing planners. To address this issue, this paper presents a multi-level optimization strategy that combines sampling-based and direct optimization methods for decision-making and control with improved safety and trajectory smoothness. In the primary stage, a sampling-based optimization framework systematically identifies safe candidate trajectories by employing the Fusion of stochastic Predictive Inter-Distance Profile (F-sPIDP). F-sPIDP encapsulates the multimodal dynamics of traffic agents and explicitly computes the uncertainties in their estimated or tracked states. From the set of trajectories, a reference optimal trajectory and its F-sPIDP setpoints are selected, adhering to stringent safety constraints and motion smoothness. Subsequently, a secondary local control optimization refines the optimal trajectory to ensure compliance with the AV’s kinematic and dynamic constraints while accounting for the quantified uncertainty within the F-sPIDP framework. The performance of the proposed method was assessed through simulations and statistical analyses, evaluating its robustness to diverse levels of uncertainty.

10:45-10:50, Paper WeAT17.4
Multi-Agent Reinforcement Learning with Transformer-Based Spatio-Temporal Fusion for Autonomous Driving in Mixed Traffic

Li, Rixin	Southern University of Science and Technology
Liu, Jia	ShenZhen Institutes of Advanced Technology, Chinese Academy of S
Sun, Tianfu	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Xu, Tiantian	Chinese Academy of Sciences
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Autonomous Vehicle Navigation Abstract: Driving decision-making in mixed traffic, characterized by high-dynamic interactions and stochastic behaviors of human-driven vehicles, poses significant challenges for autonomous driving systems. To address these issues, we propose a novel Transformer-based Spatial Temporal Fusion (TSTF) module integrated with an auxiliary contrastive learning task within a multi-agent reinforcement learning (MARL) framework. The TSTF module captures interaction-aware behaviors and long-term temporal dependencies that tackle mixed cooperative driving scenarios, while the auxiliary contrastive learning task refines feature representations to enhance exploration efficiency and decision stability. Experimental evaluations on the MetaDrive platform demonstrate that the proposed approach outperforms baseline algorithms in safety, adaptability and robustness to dynamic traffic scenarios. The results highlight the effectiveness of the TSTF module in enabling robust and context-aware collaborative driving behaviors, offering a scalable solution for real-world mixed traffic. This work advances MARL by addressing key challenges in interaction modeling and driving decision-making under uncertainty, with significant implications for the development of intelligent transportation systems.

10:50-10:55, Paper WeAT17.5
CG-Net: Urban Trajectory Forecasting with Bipartite Graphs for Agents, Scene Context and Candidate Centerlines

Bhowmik, Kaushik	Inria at Univ. Grenoble Alpes, 38000 Grenoble, France
Spalanzani, Anne	INRIA / Univ. Grenoble Alpes
Martinet, Philippe	INRIA
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Semantic Scene Understanding Abstract: Trajectory forecasting in urban environments is a critical task that needs to be addressed for the safety of autonomous vehicles, particularly in urban road intersection scenarios, where agents exhibit diverse behaviors mainly due to complex interactions between agents and the environment and the diversity of paths available to the agents. Current state-of-the-art methods do not perform well in urban road intersection scenarios. To address this issue, the proposed novel framework CandidateGraph-Net (CG-Net), improves trajectory prediction in urban road intersection scenarios by encoding the available candidate centerlines at the current location of the target agent. The proposed interaction encoder in CG-Net is inspired by human behavior. It is modeled utilizing a bipartite graph attention network to predict the trajectory of the target agent. It estimates the trajectory in the same way as humans anticipate the trajectory of other vehicles and pedestrians in dynamic environments. The agent embeddings in the interaction encoder at each time step pay attention to nearby agents and surrounding scene elements simultaneously. This enables the model to learn how to prioritize interactions between nearby agents and the environment map. Further, CG-Net's performance is evaluated using the Argoverse 2 motion forecasting dataset. The results demonstrate its effectiveness in urban road intersection scenarios, with an overall improvement in key metrics such as minFDE and minADE compared to baseline methods. These improvements highlight CG-Net’s ability to perform better motion forecasting in urban road intersection scenarios.

10:55-11:00, Paper WeAT17.6
ParkDiffusion: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction for Automated Parking Using Diffusion Models

Wei, Jiarong	University of Freiburg
Vödisch, Niclas	University of Freiburg
Rehr, Anna	Cariad SE
Feist, Christian	CARIAD SE
Valada, Abhinav	University of Freiburg
Keywords: Intelligent Transportation Systems, AI-Based Methods, Behavior-Based Systems Abstract: Automated parking is a critical feature of Advanced Driver Assistance Systems (ADAS), where accurate trajectory prediction is essential to bridge perception and planning modules. Despite its significance, research in this domain remains relatively limited, with most existing studies concentrating on single-modal trajectory prediction of vehicles. In this work, we propose ParkDiffusion, a novel approach that predicts the trajectories of both vehicles and pedestrians in automated parking scenarios. ParkDiffusion employs diffusion models to capture the inherent uncertainty and multi-modality of future trajectories, incorporating several key innovations. First, we propose a dual map encoder that processes soft semantic cues and hard geometric constraints using a two-step cross-attention mechanism. Second, we introduce an adaptive agent type embedding module, which dynamically conditions the prediction process on the distinct characteristics of vehicles and pedestrians. Third, to ensure kinematic feasibility, our model outputs control signals that are subsequently used within a kinematic framework to generate physically feasible trajectories. We evaluate ParkDiffusion on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Our work establishes a new baseline for heterogeneous trajectory prediction in parking scenarios, outperforming existing methods by a considerable margin.

11:00-11:05, Paper WeAT17.7
Adaptive Prediction Ensemble: Improving Out-Of-Distribution Generalization of Motion Forecasting

Li, Jinning	University of California, Berkeley
Li, Jiachen	University of California, Riverside
Bae, Sangjae	Honda Research Institute, USA
Isele, David	University of Pennsylvania, Honda Research Institute USA
Keywords: Intelligent Transportation Systems, Deep Learning Methods Abstract: Deep learning-based trajectory prediction models for autonomous driving often struggle with generalization to out-of-distribution (OOD) scenarios, sometimes performing worse than simple rule-based models. To address this limitation, we propose a novel framework, Adaptive Prediction Ensemble (APE), which integrates deep learning and rule-based prediction experts. A learned routing function, trained concurrently with the deep learning model, dynamically selects the most reliable prediction based on the input scenario. Our experiments on large-scale datasets, including Waymo Open Motion Dataset (WOMD) and Argoverse, demonstrate improvement in zero-shot generalization across datasets. We show that our method outperforms individual prediction models and other variants, particularly in long-horizon prediction and scenarios with a high proportion of OOD data. This work highlights the potential of hybrid approaches for robust and generalizable motion prediction in autonomous driving.

11:05-11:10, Paper WeAT17.8
When Is It Likely to Fail? Performance Monitor for Black-Box Trajectory Prediction Model (I)

Shao, Wenbo	Tsinghua University
Li, Boqi	University of Michigan, Ann Arbor
Yu, Wenhao	Tsinghua University
Xu, Jiahui	The University of Hong Kong
Wang, Hong	Tsinghua University
Keywords: AI-Based Methods, Acceptability and Trust, Intelligent Transportation Systems Abstract: Accurate trajectory prediction is vital for various applications, including autonomous vehicles. However, the complexity and limited transparency of many prediction algorithms often result in black-box models, making it challenging to understand their limitations and anticipate potential failures. This further raises potential risks for systems based on these prediction models. This study introduces the performance monitor for the black-box trajectory prediction model (PMBP) to address this challenge. The PMBP estimates the performance of black-box trajectory prediction models online, enabling informed decision-making. The study explores various methods' applicability to the PMBP, including anomaly detection, machine learning, deep learning, and ensemble, with specific monitors designed for each technique to provide online output representing prediction performance. Comprehensive experiments validate the PMBP's effectiveness, comparing different monitoring methods. Results show that the PMBP effectively achieves promising monitoring performance, particularly excelling in deep learning-based monitoring. It achieves improvement scores of 0.81 and 0.79 for average prediction error and final prediction error monitoring, respectively, outperforming previous white-box and gray-box methods. Furthermore, the PMBP's applicability is validated on different datasets and prediction models, while ablation studies confirm the effectiveness of the proposed mechanism. Hybrid prediction and autonomous driving planning experiments further show the PMBP's value from an application perspective.


WeAT18	210B
Multi-Robot Systems 5	Regular Session
Chair: Guo, Meng	Peking University
Co-Chair: Gross, Roderich	Technical University of Darmstadt

10:30-10:35, Paper WeAT18.1
Scalable Plug-And-Play Robotic Fabrics Based on Kilobot Modules

Obilikpa, Stanley Chukwuebuka	University of Sheffield
Talamali, Mohamed S.	University of Sheffield
Miyauchi, Genki	The University of Sheffield
Oyekan, John Oluwagbemiga	University of York
Gross, Roderich	Technical University of Darmstadt
Keywords: Cellular and Modular Robots, Swarm Robotics, Multi-Robot Systems Abstract: This paper presents a framework for producing robotic fabrics using square lattice formations of interlinked Kilobot modules. The framework supports: (i) fabrics of arbitrary size and shape; (ii) different types of deformable links, namely springs and rods; (iii) easy plug-and-play reconfigurability. Two decentralized straight motion controllers are tested with robotic fabrics comprising up to 81 physical modules: an open-loop controller and a controller from the literature that responds to deformations within the fabric. For spring-based robotic fabrics, the deformation-correcting controller performs best overall, whereas for rod-based robotic fabrics, it is outperformed by the open-loop controller. A decentralized turning motion controller is formally derived and examined for either type of fabric, revealing the ability of the robotic fabrics to move along a curved trajectory using open-loop control. Finally, robotic fabrics are shown to perform basic object manipulation tasks. Robotic fabrics that deploy themselves based on distributed, embodied intelligence could pave the way for novel applications, from patching broken pipes to medical uses within the human body.

10:35-10:40, Paper WeAT18.2
Swift Pursuer: A Topology-Accelerated and Robust Approach for Pursuing an Evader in Obstacle Environments with State Measurement Uncertainty

Rao, Kai	East China University of Science and Technology
Yan, Huaicheng	East China University of Science and Technology
Huang, Zhihao	East China University of Science and Technology
Yang, Penghui	East China University of Science and Technology
Lv, Yunkai	East China University of Science and Technology
Wang, Meng	East China University of Science and Technology
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Task and Motion Planning Abstract: This letter presents a topology-accelerated and robust pursuit framework for environments with obstacles considering state measurement uncertainty. Our framework consists of three primary components: the selection of virtual target points using topological heuristic method, the computation of safe pursuit regions based on Voronoi cell (VC) and the solution of an adaptive robust path controller based on Control Barrier Function (CBF). Topological heuristics broadly capture the topological structure of the environment and provide guidance for the selection of target points for each pursuer. Then the chance constrained obstacle-aware Voronoi cell (CCOVC) for each pursuer is constructed by calculating separation hyperplane and buffer terms. Finally, we formulate chance CBF and chance Control Lyapunov Function (CLF) constraints, using convex approximation to determine their upper bounds. We then find the adaptive robust path controller by solving a Quadratic Constraint Quadratic Programming (QCQP). Benchmark simulation and experimental results demonstrate the efficiency and robustness of the proposed framework.

10:40-10:45, Paper WeAT18.3
ArenaSim: A High-Performance Simulation Platform for Multi-Robot Self-Play Learning

Ke, Yuxin	Tsinghua University
Li, Shaohui	Tsinghua University
Li, Zhi	Tsinghua University
Li, Haoran	Institute of Automation, Chinese Academy of Sciences
Liu, Yu	Tsinghua University
He, You	Tsinghua University
Keywords: Multi-Robot Systems, Reinforcement Learning, Cooperating Robots Abstract: In this letter, we introduce ArenaSim, a novel simulation platform designed for realistic and efficient self-play learning in multi-robot cooperative-competitive games. Compared to previous simulation platforms designed for the same task, we achieve fine-grained simulation of the robots with rotatable gimbals and roller-independent mecanum wheels in ArenaSim and validate its fidelity through real-world experiments. To inspire further exploration of this simulation platform, we design a hierarchical structure to address the cooperative-competitive game. The hierarchical structure is composed of a high-level strategy that generates macro actions such as move and shoot, and a low-level controller that translates these macro actions into precise motion control. Furthermore, we evaluate several self-play algorithms in ArenaSim and present a benchmark. The experiments show that the multi-robot cooperative-competitive game is still challenging for self-play learning. We hope that ArenaSim can further inspire research on self-play learning and multi-robot cooperative-competitive games.

10:45-10:50, Paper WeAT18.4
Hierarchical Deep Reinforcement Learning for Computation Offloading in Autonomous Multi-Robot Systems

Gao, Wen	Northwestern Polytechnical University
Yu, Zhiwen	Northwestern Polytechnical University
Wang, Liang	Northwestern Polytechnical University
Cui, Helei	Northwestern Polytechnical University
Guo, Bin	Northwestern Polytechnical University
Xiong, Hui	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Multi-Robot Systems, Reinforcement Learning Abstract: To ensure system responsiveness, some computeintensive tasks are usually offloaded to cloud or edge computing devices. In environments where connection to external computing facilities is unavailable, computation offloading among members within an autonomous multi-robot system (AMRS) becomes a solution. The challenge lies in how to maximize the use of other members’ idle resources without disrupting their local computation tasks. Therefore, this study proposes HRL-AMRS, a hierarchical deep reinforcement learning framework designed to distribute computational loads and reduce the processing time of computational tasks within an AMRS. In this framework, the high-level must consider the impact of data loading scales determined by low-level under varying computational device states on the actual processing times. In addition, the low-level employs Long Short-Term Memory (LSTM) networks to enhance the understanding of time-series states of computing devices. Experimental results show that, across various task sizes and numbers of robots, the framework reduces processing times by an average of 4.32% compared to baseline methods.

10:50-10:55, Paper WeAT18.5
Distributed Algorithms Via Saddle-Point Dynamics for Multi-Robot Task Assignment

Huang, Yi	Beijing Institute of Technology
Kuai, Jiacheng	Beijing Institute of Technology
Cui, Shisheng	Beijing Institute of Technology
Meng, Ziyang	Tsinghua University
Jian, Sun	Beijing Institute of Technology
Keywords: Multi-Robot Systems, Task Planning, Distributed Robot Systems Abstract: This paper develops two distributed algorithms to solve multi-robot task assignment problems (MTAP). We first describe MTAP as an integer linear programming (ILP) problem and then reformulate it as a relaxed convex optimization problem. Based on the saddle-point dynamics, we propose two distributed optimization algorithms using optimistic gradient decent ascent (OGDA) and extra-gradient (EG) methods, which achieve exact convergence to an optimal solution of the relaxed problem. In most cases, such an solution reflects the optimality of the original ILP problems. For some special ILP problems, we provide a perturbation-based distributed method to avoid the inconsistency phenomenon, such that an optimal solution to any ILP problem is obtained. Compared with some decentralized algorithms requiring a central robot that communicates with the other robots, our developed algorithms are fully distributed, in which each robot only communicates with the nearest neighbors for an arbitrary connected graph. We evaluate the developed algorithms in terms of computation, communication, and data storage complexities, and compare them with some typical algorithms. It is shown that the developed algorithms have low computational and communication complexities. We also verify the effectiveness of our algorithms via numerical examples.

10:55-11:00, Paper WeAT18.6
LOMORO: Long-Term Monitoring of Dynamic Targets with Minimum Robotic Fleet under Resource Constraints

Lu, Mingke	Peking University, College of Engineering
Wang, Shuaikang	Peking University
Guo, Meng	Peking University
Keywords: Multi-Robot Systems Abstract: Long-term monitoring of numerous dynamic targets can be tedious for a human operator and infeasible for a single robot, e.g., to monitor wild flocks, detect intruders, search and rescue. Fleets of autonomous robots can be effective by acting collaboratively and concurrently. However, the online coordination is challenging due to the unknown behaviors of the targets and the limited perception of each robot. Existing work often deploys all robots available without minimizing the fleet size, or neglects the constraints on their resources such as battery and memory. This work proposes an online coordination scheme called LOMORO for collaborative target monitoring, path routing and resource charging. It includes three core components: (I) the modeling of multi-robot task assignment problem under the constraints on resources and monitoring intervals; (II) the resource-aware task coordination algorithm iterates between the high-level assignment of dynamic targets and the low-level multi-objective routing via the Martin's algorithm; (III) the online adaptation algorithm in case of unpredictable target behaviors and robot failures. It ensures the explicitly upper-bounded monitoring intervals for all targets and the lower-bounded resource levels for all robots, while minimizing the average number of active robots. The proposed methods are validated extensively via large-scale simulations against several baselines, under different road networks, robot velocities, charging rates and monitoring intervals.

11:00-11:05, Paper WeAT18.7
Distributed Neural Fixed-Time Consensus Control of Uncertain Multiple Euler-Lagrange Systems with Event-Triggered Mechanism (I)

Wang, Chen	University of Electronic Science and Technology of China
Zhan, Haoran	University of Electronic Science and Technology of China
Guo, Qing	University of Electronic Science and Technology of China
Li, Tieshan	University of Electronic Science and Technology of China
Keywords: Multi-Robot Systems Abstract: Euler-Lagrange systems are often used to describe many practical plants with strong coupling nonlinearities, such as roboticmanipulators, autonomous surface vehicles, and wearable exoskeletons. Since a single Euler-Lagrange system has limited capabilities, it is imperative to construct multiple Euler-Lagrange systems (MELSs) to collaborate on complex operational missions. However, most of the existing controllers for MELSs only obtain asymptotic stabilization, which cannot guarantee the system states fast stabilization. Furthermore, the sampling and transmission of data at high frequencies within the MELSs can lead to network congestion, thus affecting the stability of the system. Hence, a distributed neural fixed-time consensus controller with an event-triggered mechanism is presented to not only improve the leader-following consensus speed under a directed communication graph, but also save the communication resources of the MELSs. Ultimately, the effectiveness of the proposed controller is verified through simulations and experimental results about a multiple manipulator system.

11:05-11:10, Paper WeAT18.8
The Magnetized Capacitance (CM), First Resonant Frequency, and Electromagnetic Analysis of Inductors with Ferrite Cores (I)

Zhang, Rongrong	Fudan University
Zhao, Hui	Fudan University
Keywords: Multi-Robot Systems Abstract: The Parasitic capacitance of magnetic components, i.e., inductors and transformers, are crucial, because it dominates the high-frequency impedance, causes voltage / current spikes, and arouses electromagnetic interference (EMI) issues. The existing methods characterize capacitance from all five parts (turn-to-turn, layer-to-layer, winding-to-magnetic core, winding-to-electrostatic screen, and interwinding). However, the calculated results are always less than the measured capacitance, especially for the inductors with few winding turns. This article discovers another mechanism of the parasitic capacitance caused by dB/dt. It points out that a time-varying magnetic field also generates an electric field, which leads to parasitic capacitance, and proves that this capacitance dominates the first resonant frequency (denoted by fR) of the inductors with few winding turns. The factors related to the capacitance and the fR are analyzed. The proposed theoretical analysis is validated by numerous simulations and experimental results. All the simulations are included in the multimedia folder.


WeAT19	210C
Biologically-Inspired Robots 1	Regular Session
Chair: Hughes, Josie	EPFL
Co-Chair: Ta, Tung D.	The University of Tokyo

10:30-10:35, Paper WeAT19.1
Camera-Tracked Soft Underwater Robot Enabling Robust Orientation Control for Maneuverability

Bianchi, Gabriele	EPFL
Obayashi, Nana	EPFL
Petitti, Alessandro	ETH
Hughes, Josie	EPFL
Keywords: Modeling, Control, and Learning for Soft Robots, Biologically-Inspired Robots, Soft Robot Applications Abstract: Maneuverability in soft bio-inspired underwater robots, particularly for following complex trajectories, remains an unsolved challenge. In this work, we present a control approach based on a PD controller integrated with real-time camera feedback, enabling continuous and reliable free-swimming control. The system was able to maintain precise waypoint tracking for 60 minutes and more, with a minimum turning radius of 27 cm, demonstrating the robot's high maneuverability relative to its body size. Our contributions include the development of a robust camera-based tracking system, the tuning of a PD controller to enhance trajectory following, and the exploration of the limits of maneuverability in soft swimming robots. This work paves the way for future integration with onboard sensing systems to improve state estimation in soft swimmers and reduce reliance on external camera systems.

10:35-10:40, Paper WeAT19.2
Neural-Link: Non-Overlapping MPC Fusion and Passive Inertial Sensing on Soft Platforms

Dai, Zijia	Shanghaitech University
Xiao, Jinxi	ShanghaiTech University
Zhang, Xinyue	Shanghaitech University
Kneip, Laurent	ShanghaiTech University
Keywords: Modeling, Control, and Learning for Soft Robots, Sensor Fusion, AI-Based Methods Abstract: Soft, elastic platforms may pose an intricate challenge towards sensor fusion as forces acting on the structure render extrinsic transformations variable over time. The present paper tackles this problem by introducing an elastic deformation model and embedding it into a sensor fusion scheme. The core of our method is given by a neural representation mapping temporal deformation sequences onto mass-normalized restoring forces. By using continuous time trajectory models as well as Newton’s second law, the sensor fusion problem becomes solvable by enforcing the consistency between second-order trajectory differentials and network outputs. The approach is validated on a loosely-coupled, real-world fusion scenario: an elastically connected, non-overlapping stereo camera system. As demonstrated, our approach permits relative camera alignment, absolute scale recovery, as well as inertial alignment from individual visual odometry results.

10:40-10:45, Paper WeAT19.3
Hierarchical Collision-Free Configuration Planning for a Soft Manipulator

Shen, Yi	Huazhong University of Science and Technology
Tai, Ruochen	Nanyang Technological University
Hu, Feiyu	Huazhong University of Science and Technology
Liu, Zhe	Shanghai Jiao Tong University
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: Soft manipulators (SMs) have shown great potential for interactive tasks in confined environments. However, avoiding obstacles of SMs may conflict with the manipulator’s configuration, the planned trajectory for tracking control, and the target position for grasping. To coordinate configuration planning, tracking control, and target grasping in obstacle avoidance, this study proposes a hierarchical configuration planning framework with three levels: behavior planning, configuration planning, and shape/position control. At the behavior planning level, a Discrete Event System (DES)-based planner is designed to orchestrate mode transitions among obstacle avoidance, tracking control, and target grasping. The configuration planning level adopts the Bézier curve to model the SM backbone curve and constructs a repulsive potential field to quantify obstacle effects on the entire manipulator configuration. Under the constraints of grasping distance and material physical limit, the control points of the Bézier curve corresponding to the optimal configuration that minimizes the repulsive potential energy are computed. Experiments demonstrate the effectiveness of the proposed framework in achieving collision-free configuration planning for object grasping and placement in confined operational scenarios.

10:45-10:50, Paper WeAT19.4
Numerical Optimization-Based Kinematics with Pose Tracking Control for Continuum Robots

Peng, Rui	The University of Hong Kong
Deng, Ping	The University of Hong Kong
Tang, Duo	The University of Hong Kong
Lu, Peng	The University of Hong Kong
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications, Soft Robot Materials and Design Abstract: In this paper, we employ multiple IMUs into a triple-section continuum manipulator to precisely capture the attitude data of each section's end disk. Leveraging the sensory and mechanical hardware system, we construct a sophisticated coordinate transformation scheme to accurately identify the detailed configuration states of the manipulator. Additionally, we introduce a numerical optimization strategy to develop a unified forward and inverse kinematic modeling framework, ensuring both iterative efficiency and accuracy. Through the IMUs' real-time attitude feedback, we implement a closed-loop controller, enhancing the manipulator's operational robustness and agility. In our experimental evaluations, we assess the convergence performance of both forward and inverse kinematics within a simulated environment and validate the precision of these kinematic models through real-time experiments on an actual continuum manipulator. Moreover, we evaluate the performance of the proposed controller by examining its accuracy during the manipulator's continuous motions and analyzing its response characteristics. In contrast to previous research on continuum robots in the literature, we pioneer a fully integrated kinematic control architecture that is successfully implemented on a physical continuum robotic system.

10:50-10:55, Paper WeAT19.5
Contrastive Autoencoder for Robust State Modelling of Soft Robots in Incomplete and Noisy Environments

Sapai, Shageenderan	Monash University
Baskaran, Vishnu Monn	Monash University Malaysia
Nurzaman, Surya G.	Monash University
Loo, Junn Yong	Monash Malaysia
Tan, Chee Pin	Monash University
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications, AI-Based Methods Abstract: Soft robotic systems heavily depend on accurate sensor data for perception and control; however, this data is often corrupted by missing observations, due to partial sensor coverage, communication failures, or occlusions and noisy measurements stemming from hardware imperfections, environmental disturbances, and the intrinsic compliance of soft materials. Such corruption can obscure critical state information, causing unreliable modeling of soft robotics and degrading control accuracy. To address these challenges, we propose a Contrastive Dual-Latent Autoencoder (CDLAE) that jointly handles missing and noisy data in a single end-to-end framework. Our approach leverages an attention based autoencoder architecture with dual latent pathways, where one focuses on capturing the underlying clean signals while the other isolates noise-related components. A contrastive loss encourages strong separation between these pathways, enhancing the model’s ability to filter noise while reconstructing missing values. Additionally, the autoencoder is trained jointly with a downstream predictive network, ensuring that signal imputation is optimized with respect to the ultimate control task. Experimental evaluations on a pneumatic soft robot platform and multiple public time-series datasets demonstrate that CDLAE consistently outperforms existing methods in handling corrupted data, offering robust, high-fidelity reconstructions that significantly improve soft robot perception and control in real-world conditions.

10:55-11:00, Paper WeAT19.6
Adaptive Neural Control with Online Learning and Short-Term Memory for Adaptive Soft Crawling Robots

Asawalertsak, Naris	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Keywords: Modeling, Control, and Learning for Soft Robots, Bioinspired Robot Learning, Biologically-Inspired Robots Abstract: Soft-bodied crawling animals exhibit efficient and adaptive behaviors resulting from the synergy between morphological computation (e.g., a flexible soft body and anisotropic skin) and neural computation (e.g., neural control with plasticity and short-term memory (STM)). However, applying these principles to soft crawling robots remains challenging. To address this, our study proposes an adaptive neural control system that incorporates online learning and STM to generate adaptive behaviors in soft crawling robots. This control system was implemented in a robot with a flexible soft body, anisotropic abdominal denticles or skin, and embodied laser and flex sensors. The robot demonstrated a multilevel adaptation to various perturbations. Perturbations, such as rough terrain, can be managed through passive (body) adaptation via micro-deformation of the denticles and macro-deformation of the body. Larger perturbations, including being lifted or pressed, navigating confined spaces, and traversing slopes, are handled by active (neural control) adaptation. The robot can learn new behaviors, such as navigating confined spaces, and store sensory information to maintain the learned behavior robustly, even in the temporary absence of sensory feedback. In addition, it can estimate its state through sensory feedback prediction, detect abnormal states through prediction errors, and adapt its behavior to address these errors.

11:00-11:05, Paper WeAT19.7
Modeling and Neural-Network-Based Tail Oscillation Control of a Fish-Like Bionic Soft Actuation Mechanism

Meng, Qingxin	China University of Geosciences
Sun, Xuefeng	China University of Geosciences
Wang, Yawu	China University of Geosciences
Wu, Jundong	China University of Geosciences
Su, Chun-Yi	Concordia University
Keywords: Modeling, Control, and Learning for Soft Robots, Biologically-Inspired Robots, Soft Sensors and Actuators Abstract: With the progress in ocean exploration, bionic soft robotic fish have garnered significant attention, with their key feature being the actuation mechanism made from soft materials. However, the complex properties of these materials pose challenges in modeling and control. In this letter, we design and fabricate a Fish-like Bionic Soft Actuation Mechanism (FBSAM) and aim to achieve its tail oscillation control. First, we construct an experimental platform to collect data on FBSAM's motion characteristics, revealing complex nonlinear hysteresis influenced by varying liquid environments. Next, we develop a phenomenological model for FBSAM based on the Hammerstein architecture and identify its parameters via nonlinear least squares algorithm. Subsequently, we propose an integral sliding mode hybrid control strategy, introducing an inverse hysteresis compensator to address hysteresis issue and using the neural network to estimate uncertain disturbances caused by liquid environments. Finally, experimental results demonstrate that the designed FBSAM can oscillate in water like a real fish, and the proposed control strategy adapts to various external environments, maintaining excellent performance even in dynamic flow conditions, showcasing its effectiveness and superiority.

11:05-11:10, Paper WeAT19.8
Optimization and Deployment of Gait Control for Soft Robotic Fish Based on a Simulation Environment

Liu, Sijia	Harbin Engineering University
Liu, Chunbao	Jilin University
Wei, Guowu	Salford University
Ren, Luquan	Jilin University
Ren, Lei	Jilin University
Keywords: Modeling, Control, and Learning for Soft Robots, Biologically-Inspired Robots, Hydraulic/Pneumatic Actuators, Marine Robotics Abstract: This paper explores a hydraulically powered double-joint soft robotic fish called HyperTuna and a set of locomotion optimization methods. HyperTuna has an innovative, highly efficient actuation structure that includes a four-cylinder piston pump and a double-joint soft actuator with self-sensing. We conducted deformation analysis on the actuator and established a finite element model to predict its performance. A closed-loop strategy combining a central pattern generator controller and a proportional–integral–derivative controller was developed to control the swimming posture accurately. Next, a dynamic model for the robotic fish was established considering the soft actuator, and the model parameters were identified via data-driven methods. Then, a particle swarm optimization algorithm was adopted to optimize the control parameters and improve the locomotion performance. Experimental results showed that the maximum speed increased by 3.6% and the cost of transport (COT) decreased by up to 13.9% at 0.4 m/s after optimization. The proposed robotic fish achieved a maximum speed of 1.12 BL/s and a minimum COT of 12.1 J/(kg·m), which are outstanding relative to those of similar soft robotic fish. Lastly, HyperTuna completed turning and diving–floating movements and long-distance continuous swimming in open water, which confirmed its potential for practical application.


WeAT20	210D
Grasping & Manipulation 1	Regular Session
Chair: Chen, Jiahao	ShanghaiTech University
Co-Chair: Xu, Wei	Shanghai Jiao Tong University

10:30-10:35, Paper WeAT20.1
Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand

Xu, Wei	Shanghai Jiao Tong University
Zhao, Yanchao	Shanghai Jiao Tong Universtiy
Guo, Weichao	Shanghai Jiao Tong University
Sheng, Xinjun	Shanghai Jiao Tong University
Keywords: Multifingered Hands, In-Hand Manipulation, Reinforcement Learning Abstract: Manipulating articulated tools, such as tweezers or scissors, has rarely been explored in previous research. Unlike rigid tools, articulated tools change their shape dynamically, creating unique challenges for dexterous robotic hands. In this work, we present a hierarchical, goal-conditioned reinforcement learning (GCRL) framework to improve the manipulation capabilities of anthropomorphic robotic hands using articulated tools. Our framework comprises two policy layers: (1) a low-level policy that enables the dexterous hand to manipulate the tool into various configurations for objects of different sizes, and (2) a high-level policy that defines the tool's goal state and controls the robotic arm for object-picking tasks. We employ an encoder, trained on synthetic pointclouds, to estimate the tool's affordance states—specifically, how different tool configurations (e.g., tweezer opening angles) enable grasping of objects of varying sizes—from input point clouds, thereby enabling precise tool manipulation. We also utilize a privilege-informed heuristic policy to generate replay buffer, improving the training efficiency of the high-level policy. We validate our approach through real-world experiments, showing that the robot can effectively manipulate a tweezer-like tool to grasp objects of diverse shapes and sizes with a 70.8% success rate. This study highlights the potential of RL to advance dexterous robotic manipulation of articulated tools.

10:35-10:40, Paper WeAT20.2
CATCH-FORM-3D: Compliance-Aware Tactile Control and Hybrid Deformation Regulation for 3D Viscoelastic Object Manipulation

Ma, Hongjun	South China University of Technology
Li, Weichang	South China University of Technology
Keywords: Multifingered Hands, Model Learning for Control, Manipulation Planning Abstract: This paper investigates a framework (CATCH-FORM-3D) for the precise contact force control and surface deformation regulation in viscoelastic material manipulation. A partial differential equation (PDE) is proposed to model the spatiotemporal stress-strain dynamics, integrating 3D Kelvin–Voigt (stiffness-damping) and Maxwell (diffusion) effects to capture the material’s viscoelastic behavior. Key mechanical parameters (stiffness, damping, diffusion coefficients) are estimated in real time via a PDE-driven observer. This observer fuses visual-tactile sensor data and experimentally validated forces to generate rich regressor signals. Then, an inner-outer loop control structure is built up. In the outer loop, the reference deformation is updated by a novel admittance control law, i.e., a proportional-derivative (PD) feedback law with contact force measurements, ensuring that the system responds adaptively to external interactions. In the inner loop, a reaction-diffusion PDE for the deformation tracking error is formulated and then exponentially stabilized by conforming the contact surface to analytical geometric configurations (i.e., defining Dirichlet boundary conditions). This dual-loop architecture enables the effective deformation regulation in dynamic contact environments. Experiments using a PaXini robotic hand demonstrate sub-millimeter deformation accuracy and stable force tracking (pm 5 % deviation). The framework advances compliant robotic interactions in applications like industrial assembly, polymer shaping, surgical treatment, and household service.

10:40-10:45, Paper WeAT20.3
The KIT Robotic Hands - a Scalable Humanoid Hand Platform with Multi-Modal Sensing and In-Hand Embedded Processing

Starke, Julia	University of Lübeck
Hundhausen, Felix	Karlsruhe Institute of Technology
Weiner, Pascal	Karlsruhe Institute of Technology
Rader, Samuel	Karlsruhe Institute of Technology (KIT)
Hyseni, Engjell	Karlsruhe Institute of Technology (KIT)
Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Keywords: Multifingered Hands, Grasping, Mechanism Design Abstract: Humanoid robotic hands need to be versatile and capable of providing environmental information in order to serve as a platform for intelligent grasp control. To facilitate the design process of such hands, we present the KIT Robotic Hands. They have been designed to meet diverse application requirements through their scalability in size, actuation, sensorization and computing resources. The hands integrate a multi-modal sensor system, in-hand embedded processing capabilities, an adaptive underactuated mechanism and a continuously controllable thumb rotation to enhance dexterity. The flexibility of the design is demonstrated through two application-specific hand implementations: one is the ARMAR-7 hand, which has human hand dimensions for grasping daily objects in household tasks, the other is the ARMAR-DE hand, a larger hand designed for grasping bigger objects in decontamination tasks. We describe the design and mechatronics of the hands as well as an evaluation of the grasp success and image segmentation based on an in-hand integrated camera and onboard processing of visual data.

10:45-10:50, Paper WeAT20.4
Design of an Affordable, Fully-Actuated Biomimetic Hand for Dexterous Teleoperation Systems

Wan, Zhaoliang	Sun Yat-Sen University
Zhou, Zida	Sun Yat-Sen University
Bi, Zetong	Sun Yat-Sen University
Yang, Zehui	Sun Yat-Sen University
Ding, Hao	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Multifingered Hands, Telerobotics and Teleoperation, Dexterous Manipulation Abstract: This paper addresses the scarcity of affordable, fully-actuated five-fingered hands for dexterous teleoperation, which is crucial for collecting large-scale real-robot data within the "Learning from Demonstrations'" paradigm. We introduce the prototype version of the RAPID Hand, the first low-cost, 20-degree-of-actuation (DoA) dexterous hand that integrates a novel anthropomorphic actuation and transmission scheme with an optimized motor layout and structural design to enhance dexterity. Specifically, the RAPID Hand features a universal phalangeal transmission scheme for the non-thumb fingers and an omnidirectional thumb actuation mechanism. Prioritizing affordability, the hand employs 3D-printed parts combined with custom gears for easier replacement and repair. We assess the RAPID Hand's performance through quantitative metrics and qualitative testing in a dexterous teleoperation system, which is evaluated on three challenging tasks: multi-finger retrieval, ladle handling, and human-like piano playing. The results indicate that the RAPID Hand’s fully actuated 20-DoF design holds significant promise for dexterous teleoperation. The project will be open-sourced.

10:50-10:55, Paper WeAT20.5
ContactDexNet: Multi-Fingered Robotic Hand Grasping in Cluttered Environments through Hand-Object Contact Semantic Mapping

Zhang, Lei	University of Hamburg
Bai, Kaixin	University of Hamburg
Huang, Guowen	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Chen, Zhaopeng	University of Hamburg
Knoll, Alois	Tech. Univ. Muenchen TUM
Zhang, Jianwei	University of Hamburg
Keywords: Multifingered Hands, Perception for Grasping and Manipulation, Grasping Abstract: The deep learning models has significantly advanced dexterous manipulation techniques for multi-fingered hand grasping. However, the contact information-guided grasping in cluttered environments remains largely underexplored. To address this gap, we have developed a method for generating multi-fingered hand grasp samples in cluttered settings through contact semantic map. We introduce a contact semantic conditional variational autoencoder network (CoSe-CVAE) for creating comprehensive contact semantic map from object point cloud. We utilize grasp detection method to estimate hand grasp poses from the contact semantic map. Finally, an unified grasp evaluation model PointNetGPD++ is designed to assess grasp quality and collision probability, substantially improving the reliability of identifying optimal grasps in cluttered scenarios. Our grasp generation method has demonstrated remarkable success, outperforming state-of-the-art (SOTA) methods by at least 4.7%, with 81.0% average grasping success rate in real-world single-object grasping using a known hand, and by at least 9.0% when using an unknown hand. Moreover, in cluttered scenes, our method attains a 76.7% success rate, outperforming the SOTA method by 6.3%. We also proposed the multi-modal multi-fingered grasping dataset generation method. Our multi-fingered hand grasping dataset outperforms previous datasets in scene diversity, modality diversity. More details and supplementary materials can be found at https://sites.google.com/view/contact-dexnet.

10:55-11:00, Paper WeAT20.6
ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning

Christoph, Clemens Claudio	ETH Zürich
Eberlein, Maximilian	ETH Zurich
Katsimalis, Filippos	ETH Zurich
Roberti, Arturo Jonathan	ETH Zürich
Sympetheros, Aristotelis	ETH Zurich
Vogt, Michel Ryan	ETH Zürich
Liconti, Davide	ETH Zurich
Yang, Chenyu	ETH Zurich
Cangan, Barnabas Gavin	ETH Zurich
Hinchet, Ronan	ETH Zurich
Katzschmann, Robert Kevin	ETH Zurich
Keywords: Multifingered Hands, Dexterous Manipulation, Tendon/Wire Mechanism Abstract: General-purpose robots should possess human-like dexterity and agility to perform tasks with the same versatility as us. A human-like form factor further enables the use of vast datasets of human-hand interactions. However, the primary bottleneck in dexterous manipulation lies not only in software but arguably even more in hardware. Robotic hands that approach human capabilities are often prohibitively expensive, bulky, or require enterprise-level maintenance, limiting their accessibility for broader research and practical applications. What if the research community could get started with reliable dexterous hands within a day? We present the open-source ORCA hand, a reliable and anthropomorphic 17-DoF tendon-driven robotic hand with integrated tactile sensors, fully assembled in less than eight hours and built for a material cost below 2,000 CHF. We showcase ORCA's key design features such as popping joints, auto-calibration, and tensioning systems that significantly reduce complexity while increasing reliability, accuracy, and robustness. We benchmark the ORCA hand across a variety of tasks, ranging from teleoperation and imitation learning to zero-shot sim-to-real reinforcement learning. Furthermore, we demonstrate its durability, withstanding more than 10,000 continuous operation cycles---equivalent to approximately 20 hours---without hardware failure, the only constraint being the duration of the experiment itself. Video is here: youtu.be/kUbPSYMmOds. All design files, source code, and documentation are available at orca.ethz.ch.

11:00-11:05, Paper WeAT20.7
B4P: Simultaneous Grasp and Motion Planning for Object Placement Via Parallelized Bidirectional Forests and Path Repair

Leebron, Benjamin H.	Rice University
Ren, Kejia	Rice University
Chen, Yiting	Rice University
Hang, Kaiyu	Rice University
Keywords: Manipulation Planning, Motion and Path Planning, Grasping Abstract: Robot pick and place systems have traditionally decoupled grasp, placement, and motion planning to build sequential optimization pipelines with an assumption that the individual components will be able to work together. However, this separation introduces sub-optimality, as grasp choices may limit, or even prohibit, feasible motions for a robot to reach the target placement pose, particularly in cluttered environments with narrow passages. To this end, we propose a forest-based planning framework to simultaneously find grasp configurations and feasible robot motions that explicitly satisfy downstream placement configurations paired with the selected grasps. Our proposed framework leverages a bidirectional sampling-based approach to build a start forest, rooted at the feasible grasp regions, and a goal forest, rooted at the feasible placement regions, to facilitate the search through randomly explored motions that connect valid pairs of grasp and placement trees. We demonstrate that the framework’s inherent parallelism enables superlinear speedup, making it scalable for applications for redundant robot arms, e.g., 7 DoF, to work efficiently in highly cluttered environments. Extensive experiments in simulation demonstrate the robustness and efficiency of the proposed framework in comparison with multiple baselines under diverse scenarios.

11:05-11:10, Paper WeAT20.8
SeGMan: Sequential and Guided Manipulation Planner for Robust Planning in 2D Constrained Environments

Tuncer, Cankut Bora	Bilkent University
Haliloglu, Dilruba Sultan	Bilkent University
Oguz, Ozgur S.	Bilkent University
Keywords: Manipulation Planning, Mobile Manipulation, Autonomous Agents Abstract: In this paper, we present SeGMan, a hybrid motion planning framework that integrates sampling-based and optimization-based techniques with a guided forward search to address complex, constrained sequential manipulation challenges, such as pick-and-place puzzles. SeGMan incorporates an adaptive subgoal selection method that adjusts the granularity of subgoals, enhancing overall efficiency. Furthermore, proposed generalizable heuristics guide the forward search in a more targeted manner. Extensive evaluations in maze-like tasks populated with numerous objects and obstacles demonstrate that SeGMan is capable of generating not only consistent and computationally efficient manipulation plans but also outperform state-of-the-art approaches.


WeAT21	101
Force and Tactile Sensing 1	Regular Session
Chair: Wang, Yancheng	Zhejiang University
Co-Chair: Wang, Shuai	Tencent

10:30-10:35, Paper WeAT21.1
Dual-Modal Soft Magnetic Skin with Anti-Magnetic Interference Structure for Tactile Perception

Xiong, Pengwen	Nanchang University
Peng, Huan	Nanchang University
Zhang, Yu	Nanchang University
Song, Aiguo	Southeast University
Liu, Peter X.	Carleton University
Keywords: Force and Tactile Sensing, Recognition, Cyborgs Abstract: Traditional magnetic tactile sensors are highly susceptible to external magnetic field interference, limiting their reliability in practical applications. To address this challenge, we propose a dual-modal soft magnetic skin capable of simultaneously acquiring magnetic and force tactile information across spatiotemporal domains, inspired by the sensory mechanisms of human skin. The system integrates a Convolutional Neural Network-Convolutional Neural NetworkMultilayer Perceptron (CNN-CNN-MLP) architecture to fuse these dual-modal signals effectively. Furthermore, we introduce a novel Dynamic Weighting Coefficient Layer (DWCL) to dynamically optimize fusion weights for each modality based on real-time input characteristics, thereby enhancing robustness against magnetic interference. The DWCL leverages temporal discrepancies between modalities during pre-contact sensing and quantifies the magnetic field strength of target objects to autonomously adjust fusion ratios, prioritizing the more reliable modality under varying interference conditions. Extensive experimental evaluations demonstrate that the proposed DWCL significantly improves interference resistance compared to conventional fusion methods, advancing the feasibility of magnetic tactile sensing in real-world environments.

10:35-10:40, Paper WeAT21.2
A Natural Human-Robot Interaction System for Teleoperation Based on Noncontact Haptic Feedback

Wei, Letian	Nanchang University
Xiong, Pengwen	Nanchang University
Wei, Qi	Nanchang University
Song, Aiguo	Southeast University
Zhou, MengChu	New Jersey Institute of Technology
Keywords: Force and Tactile Sensing, Touch in HRI, Haptics and Haptic Interfaces Abstract: In order to provide natural and immersive interactive experience for teleoperation in the context of human-robot collaboration and interaction, this work introduces a natural human-robot interaction system for teleoperation based on ultrasonic haptic feedback. Specifically, our system can accurately capture an operator's hand movements and replicate these actions on the remote robot with low latency and high fidelity. It utilizes an ultrasonic phased array to achieve non-contact haptic feedback. We propose a dynamic ultrasonic array acoustic field customization method based on interactive feature information image. This method can dynamically adjust the acoustic field according to the operator's hand characteristics, focus on multiple target points in real time, and project them onto an operator's fingertips, thereby providing force-controllable non-contact haptic feedback to the operator. The operator is integrated into the feedback loop of our system, controlling the system through multimodal feedback to form a high-quality human-in-the-loop closed control system. The system's performance is validated in two classic robotic tasks: block pick-and-place and nut-tightening. The experimental results show that the system exhibits excellent accuracy and dexterity, and can efficiently complete tasks with high accuracy while providing great interactive experience for operators.

10:40-10:45, Paper WeAT21.3
Design of Scalable Orthogonal Digital Encoding Architecture for Large-Area Flexible Tactile Sensing in Robotics

Liu, Weijie	Zhejiang University
Qiu, Ziyi	Zhejiang University
Wang, Shihang	Zhejiang University
Mei, Deqing	Zhejiang University
Wang, Yancheng	Zhejiang University
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Sensor Networks Abstract: Human-like embodied tactile perception is crucial for the next-generation intelligent robotics. Achieving large-area, full-body soft coverage with high sensitivity and rapid response, akin to human skin, remains a formidable challenge due to critical bottlenecks in encoding efficiency and wiring complexity in existing flexible tactile sensors, thus significantly hinder the scalability and real-time performance required for human skin-level tactile perception. Herein, we present a paradigm-shifting architecture employing code division multiple access- inspired orthogonal digital encoding to overcome these challenges. Our decentralized encoding strategy transforms conventional serial signal transmission by enabling parallel superposition of energy-orthogonal base codes from distributed sensing nodes, drastically reducing wiring requirements and increasing data throughput. We implemented and validated this strategy with off-the-shelf 16-node sensing array to reconstruct the pressure distribution, achieving a temporal resolution of 12.8 ms using only a single transmission wire. Crucially, the architecture can maintain sub-20ms latency across orders-of-magnitude variations in node number (to thousands of nodes). By fundamentally redefining signal encoding paradigms in soft electronics, this work opens new frontiers in developing scalable embodied intelligent systems with human-like sensory capabilities.

10:45-10:50, Paper WeAT21.4
ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Wu, Zhiyuan	King's College London
Zhao, Yongqiang	King's College London
Luo, Shan	King's College London
Keywords: Force and Tactile Sensing Abstract: Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.

10:50-10:55, Paper WeAT21.5
Learning Force Distribution Estimation for the GelSight Mini Optical Tactile Sensor Based on Finite Element Analysis

Helmut, Erik	Technische Universität Darmstadt
Dziarski, Luca Vitus	TU Darmstadt
Funk, Niklas Wilhelm	TU Darmstadt
Belousov, Boris	German Research Center for Artificial Intelligence - DFKI
Peters, Jan	Technische Universität Darmstadt
Keywords: Force and Tactile Sensing, Deep Learning Methods, Representation Learning Abstract: Contact-rich manipulation remains a major challenge in robotics. Optical tactile sensors like GelSight Mini offer a low-cost solution for contact sensing by capturing soft-body deformations of the silicone gel. However, accurately inferring shear and normal force distributions from these gel deformations has yet to be fully addressed. In this work, we propose a machine learning approach using a U-net architecture to predict force distributions directly from the sensor's raw images. Our model, trained on force distributions inferred from Finite Element Analysis (FEA), demonstrates promising accuracy in predicting normal and shear force distributions for the commercially available GelSight Mini sensor. It also shows potential for generalization across indenters, sensors of the same type, and for enabling real-time application. The codebase, dataset and models are open-sourced and available at https://feats-ai.github.io.

10:55-11:00, Paper WeAT21.6
SuperMag: Vision-Based Tactile Data Guided High-Resolution Tactile Shape Reconstruction for Magnetic Tactile Sensors

Hou, Peiyao	Beihang University
Sun, Danning	Beihang University
Wang, Meng	Beijing Institute for General Artificial Intelligence
Huang, Yuzhe	Beijing University of Aeronautics and Astronautics
Zhang, Zeyu	Beijing Institute for General Artificial Intelligence
Liu, Hangxin	Beijing Institute for General Artificial Intelligence (BIGAI)
Li, Wanlin	Beijing Institute for General Artificial Intelligence (BIGAI)
Jiao, Ziyuan	Beijing Institute for General Artificial Intelligence
Keywords: Force and Tactile Sensing Abstract: Magnetic-based tactile sensors (MBTS) combine the advantages of compact design and high-frequency operation but suffer from limited spatial resolution due to their sparse taxel arrays. This paper proposes SuperMag, a tactile shape reconstruction method that addresses this limitation by leveraging high-resolution vision-based tactile sensor (VBTS) data to supervise MBTS super-resolution. Co-designed, open-source VBTS and MBTS with identical contact modules enable synchronized data collection of high-resolution shapes and magnetic signals via a symmetric calibration setup. We frame tactile shape reconstruction as a conditional generative problem, employing a conditional variational auto-encoder to infer high-resolution shapes from low-resolution MBTS inputs. The MBTS achieves a sampling frequency of 125 Hz, whereas the shape reconstruction sustains an inference time within 2.5 ms. This cross-modality synergy advances tactile perception of the MBTS, potentially unlocking its new capabilities in high-precision robotic tasks.

11:00-11:05, Paper WeAT21.7
A Neuromorphic Tactile System for Reliable Braille Reading in Noisy Environments

Xu, Xingchen	University of Bristol
Lepora, Nathan	University of Bristol
Ward-Cherrier, Benjamin	University of Bristol
Keywords: Force and Tactile Sensing, Neurorobotics Abstract: Neuromorphic sensors are a promising technology in artificial touch due to their low latency and low computational and power requirements, particularly when paired with spiking neural networks (SNNs). Here, we explore the ability of these systems to adapt to and generalize across varying sources of uncertainty in tactile tasks. We choose Braille reading as an application task and collect event-based data for 27 braille characters with a neuromorphic tactile sensor (NeuroTac) under varying conditions of tapping speed, center position and indentation depth using a 6-DOF robot arm. We initially analyze the effect of spatial location and speed on classification performance with spiking convolutional neural networks (SCNNs). We then show that SCNNs are able to generalize across each dimension. The final general SCNN model reaches 95.33% accuracy with uncertainty in all 4 dimensions. This research demonstrates the noise degradation performance of SCNNs in a tactile task, and outlines the potential of a single SCNN to generalize across several dimensions of uncertainty.


WeAT22	102A
Mechanism and Control	Regular Session
Chair: Zhong, Yong	South China University of Technology
Co-Chair: Xin, Xin	Southeast University

10:30-10:35, Paper WeAT22.1
NMM-HRI: Natural Multi-Modal Human-Robot Interaction with Voice and Deictic Posture Via Large Language Model

Lai, Yuzhi	University of Tübingen
Yuan, Shenghai	Nanyang Technological University
Nassar, Youssef	Reutlingen University
Mingyu, Fan	Donghua University
Gopal, Atmaraaj	Neura Robotics GmbH
Arihiro, Yorita	Kwansei Gakuin University
Kubota, Naoyuki	Tokyo Metropolitan University
Matthias, Ratsch	University Reutlingen
Keywords: Virtual Reality and Interfaces, Social HRI, Human-Robot Collaboration Abstract: Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems often rely on hand gestures or verbal commands, which can be restrictive and less practical for the elderly, making it challenging for them to remember complex word syntax or sign language. This paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. Using a large language model (LLM) and voice-to-text commands coupled with temporally aligned selected bounding boxes, sets of robot action sequences are generated. The action sequence is further constrained to key control syntax to prevent potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e 6-DoF robot arm. Compared to gesture control, utilizing parallel multi-modal sequences for robot control demonstrated a time-saving of up

10:35-10:40, Paper WeAT22.2
Data-Driven Kinematic Modeling and Control of a Cable-Driven Parallel Mechanism Allowing Cables to Wrap on Rigid Bodies (I)

Xiong, Hao	Harbin Institute of Technology, Shenzhen
Xu, Yuchen	Harbin Institute of Technology(Shenzhen)
Zeng, Weifeng	Harbin Institute of Technology, Shenzhen
Zou, Yongwei	Harbin Institute of Technology, Shenzhen
Lou, Yunjiang	Harbin Institute of Technology, Shenzhen
Keywords: Tendon/Wire Mechanism, Model Learning for Control Abstract: Cable-Driven Parallel Mechanisms (CDPMs) are subject to collision-free constraints from birth to the present. The collision-free constraints confine the workspace of CDPMs. To expand the workspace, scholars suggested allowing cables to wrap on rigid bodies (e.g., the end-effector) in recent years, opening new perspectives for CDPMs. However, the modeling and control of a CDPM with cables wrapped on general rigid bodies remain challenging. To this end, this study investigates the necessary conditions for a path of a cable of a CDPM wrapped on a smooth and frictionless rigid body. The solvability of the necessary conditions and the kinematics of the CDPM is explored then. It is shown that the kinematics of CDPMs with cables wrapped on rigid bodies, except for certain simple rigid bodies such as cylinders and spheres, is usually analytically unsolvable. To address the analytically unsolvable kinematics, the study develops a data-driven kinematic modeling and control strategy. The study applies the strategy to control the orientation of spatial rotational CDPM prototypes with wrapped cables and compares the strategy to a classical Jacobian-based kinematic control strategy. Experimental results indicate that for a CDPM allowing cables to wrap on a cylinder, the data-driven kinematic modeling and control strategy outperforms the Jacobian-based kinematic control strategy. For a CDPM allowing cables to wrap on a deformed cylinder, the data-driven kinematic modeling and control strategy can effectively control the CDPM.

10:40-10:45, Paper WeAT22.3
On-Line Shape Estimation for Hysteresis Compensation in Tendon-Sheath Mechanisms Using Endoscopic Camera

Hong, Junho	Korea University
Hong, Daehie	Korea University
Kim, Chanwoo	Korea University
Won, SeongHyeon	Korea University
Keywords: Tendon/Wire Mechanism, Medical Robots and Systems, Flexible Robotics Abstract: The tendon-sheath mechanism (TSM) has significantly advanced both robotic systems and minimally invasive surgery (MIS) by enabling flexible and precise movement through narrow and tortuous paths. However, the inherent flexibility of TSM introduces nonlinear behaviors which depend on its geometrical shape and applied forces, making accurate control challenging. Furthermore, the shape dependency becomes critical in endoscopic robots, where the geometrical shape varies and is not directly visible, limiting the applicability of existing distal sensorless compensation methods. To address the geometry identification problem of TSM, this paper proposes an approach that utilizes real-time visual input from an endoscopic camera for on-line calibration of the TSM's physical model. By introducing the concept of the 'Equivalent Circle,' complex shapes of TSMs are simplified, enabling the estimation of their equivalent geometry without direct observation or measurement. Simulation results validate the equivalent circle model, demonstrating minimal deadband percentage errors despite larger discrepancies in equivalent radii across varied configurations. On-line calibration experiments achieved a percent error of 1.38% (±2.92%) for accumulated curve angles and 2.32% (±3.08%) for equivalent radii, demonstrating the method’s reliability in shape estimation across varying conditions. In prediction and feedforward experiments, leveraging the equivalent circle to compensate for deadband in arbitrarily shaped TSMs resulted in a maximum trajectory error of 0.25 mm and an RMSE of 0.09 mm. This approach advances distal sensorless control, improving the operational accuracy and feasibility of endoscopic surgical robots under varying geometrical and force conditio

10:45-10:50, Paper WeAT22.4
Performance Optimization of a Fish-Like Propeller Based on Continuum Driving

Wang, Qixin	South China University of Technology
Wang, Kangzheng	South China University of Technology
Gao, Fei	Chinese Academy of Sciences
Zhong, Yong	South China University of Technology
Keywords: Tendon/Wire Mechanism, Biomimetics, Flexible Robotics Abstract: The excellent swimming ability of fish provides a new solution for the design of underwater propellers. However, there is a lack of optimization schemes for the structure and performance of the continuum-driving fish-like propeller. This paper introduces a continuum-driving fish-like propeller capable of online adjustments to the ratio of its active part and passive part (RAP). Then, a hydrodynamic model of the designed platform is established based on the being gradually softened approach (BGSA) to analyze the active and passive bending of the caudal fin. Finally, the coupling effect of RAP and the passive part stiffness (PS) on the propulsion performance of the fish-like propeller is explored through experiments. Results indicate that matching RAP and PS with the kinematic characteristics is essential for optimal propulsion performance. With the increase of flapping frequency, RAP and PS need to be increased. This work provides valuable guidance and insights for the design and optimization of the continuum driving fish-like propellers.

10:50-10:55, Paper WeAT22.5
Equilibrium Postural Control of a Spatial Underactuated Robot Based on Angular Momentum

Hu, Jiangyong	Southeast University
Xin, Xin	Southeast University
Keywords: Underactuated Robots, Body Balancing, Motion Control Abstract: The spatial underactuated robot, which consists of two links moving in 3D space, serves as a simplified model for human postural control. The robot is supported at a single point, with 2 degrees of freedom (DoF) underactuated at the ankle joint and 2-DoF actuated at the hip joint. This paper proposes an angular momentum-based feedback controller for the robot. The controller can locally stabilize the robot to a given equilibrium point and enable transitions between different equilibrium points. Firstly, this paper extends the angular momentum model from planar to 3D space, revealing that the robot's balance behavior is determined by eight key parameters closely related to the choice of state variables. This paper analyzes the impact of these parameters on the controllability of the angular momentum model. Secondly, this paper provides analytical expressions for the control gain parameters using the pole placement method. Finally, this paper identifies the equilibrium points where the controller may encounter singularities and demonstrates through simulations that the controller can effectively stabilize the robot to a given equilibrium point and enable transitions between different equilibrium points while maintaining balance.

10:55-11:00, Paper WeAT22.6
Bi-Level Optimization for Closed-Loop Model Reference Adaptive Vibration Control in Wheeled-Legged Multi-Mode Vehicles (I)

Qin, Yechen	Beijing Institute of Technology
Zhu, Zhewei	Beijing Institute of Technology
Zhou, Yunping	Beijing Institute of Technology
Bai, Guangyu	Beijing Institute of Technology
Wang, Kui	Beijing Institute of Technology
Xu, Tao	Beijing Institute of Technology
Keywords: Wheeled Robots, Engineering for Robotic Systems, Optimization and Optimal Control Abstract: The wheeled-legged multi-mode vehicle (WLMV) combines the merits of both wheel and quadruped platforms and is gradually becoming a major development for high-mobility transportation in complex scenarios. To enhance the motility of the WLMV in challenging terrains, this paper proposes a bi-level optimization closed-loop model reference adaptive vibration control method (BO-CMRAC). The parameters and control weights of the reference model are determined through a bi-level optimization approach, which includes offline optimization and online weights selection, to coordinate motility and vibration suppression capabilities across varying scenarios. The dynamic characteristics of the actuator are modeled and incorporated into the closed-loop reference model to further improve the robustness and performance of the controller. A WLMV experimental platform is established to validate the proposed vibration control strategy on different roads. The results indicate that the proposed BO-CMRAC can effectively suppress the vibration of the WLMV compared to current algorithms.

11:00-11:05, Paper WeAT22.7
Constrained Visual Predictive Control of a Robotic Flexible Endoscope with Visibility and Joint Limits Constraints

Deng, Zhen	Fuzhou University
Liu, Weiwei	Fuzhou University
Li, Guotao	Institute of Automation Chinese Academy of Sciences
Zhang, Jianwei	Hamburg University
Keywords: Tendon/Wire Mechanism, Visual Servoing, Surgical Robotics: Steerable Catheters/Needles Abstract: In this letter, a constrained visual predictive control strategy (C-VPC) is developed for a robotic flexible endoscope to precisely track target features in narrow environments while adhering to visibility and joint limit constraints. The visibility constraint, crucial for keeping the target feature within the camera's field of view, is explicitly designed using zeroing control barrier functions to uphold the forward invariance of a visible set. To automate the robotic endoscope, kinematic modeling for image-based visual servoing is conducted, resulting in a state-space model that facilitates the prediction of the future evolution of the endoscopic state. The C-VPC method calculates the optimal control input by optimizing the model-based predictions of the future state under visibility and joint limit constraints. Both simulation and experimental results demonstrate the effectiveness of the proposed method in achieving autonomous target tracking and addressing the visibility constraint simultaneously. The proposed method achieved a reduction of 12.3% in Mean Absolute Error (MAE) and 56.0% in variance (VA) compared to classic IBVS.

11:05-11:10, Paper WeAT22.8
Time-Scaling Modeling and Control of Robotic Sewing System (I)

Tang, Kai	The University of Hong Kong
Tokuda, Fuyuki	Centre for Transformative Garment Production
Seino, Akira	Centre for Transformative Garment Production
Kobayashi, Akinari	Centre for Transformative Garment Production
Tien, Norman	University of Hong Kong
Kosuge, Kazuhiro	The University of Hong Kong
Keywords: Control Architectures and Programming, Nonholonomic Mechanisms and Systems, Foundations of Automation Abstract: Automating the sewing process presents significant challenges due to the inherent softness of fabrics and the limited control capabilities of sewing systems. To realize sewing automation, we propose a time-scaling modeling and control architecture of the robotic sewing system. By using the time-scaling modeling, the nonholonomic kinematics of the sewing process of the industrial sewing machine is linearized precisely. Based on this model, a two-layer real-time control architecture is proposed. The upper layer controls the sewn seam line trajectory using the model-based feedback control implemented in the time-scaling domain, while the lower layer controls the manipulator and the sewing machine using geometric-based trajectory generation and coordinated motion control of the robot and the sewing system in the time domain. The experimental results demonstrate that the sewing trajectories exponentially converge to the desired trajectories without overshooting under different initial conditions and sewing speeds. Besides, the same sewing trajectories under different sewing speeds are obtained for a given stitch size. The sewing results show the good performance and application potential of the proposed robotic sewing system.


WeAT23	102B
Path Planning for Multiple Mobile Robots or Agents 1	Regular Session
Chair: Chen, Liangming	Southern University of Science and Technology
Co-Chair: Bıyık, Erdem	University of Southern California

10:30-10:35, Paper WeAT23.1
RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding across Different Environments and Tasks

Tang, Yimin	University of Southern California
Xiong, Xiao	University of Cambridge
Xi, Jingyi	Zhejiang University
Li, Jiaoyang	Carnegie Mellon University
Bıyık, Erdem	University of Southern California
Koenig, Sven	University of Southern California
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Imitation Learning Abstract: Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community's continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

10:35-10:40, Paper WeAT23.2
Multi-Robot Ergodic Trajectory Optimization with Relaxed Periodic Connectivity

Liu, Yongce	Shanghai Jiao Tong University
Ren, Zhongqiang	Shanghai Jiao Tong University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Optimization and Optimal Control Abstract: This paper considers a multi-robot trajectory planning problem with inter-robot connectivity maintenance for information gathering. Given an information map in the form of a distribution over the workspace, ergodic search plans trajectories, along which, the time spent in any region is proportional to the amount of information in that region, and can balance between exploration and exploitation. Existing ergodic search rarely considers the limited communication range among robots or connectivity maintenance, and this paper takes a step to fill this gap. Besides, multi-robot connectivity maintenance was studied a lot, including continual, periodic, intermittent connectivity, etc. Naively combining these methods with ergodic search may prevent the planner from finding high-quality ergodic trajectories or lead to poor connectivity among the robots. To handle the challenge, this paper adapts an intermittent connectivity maintenance strategy to the ergodic search framework, and develops a two-phase trajectory planning approach utilizing the augmented Lagrangian method. Our simulation and real drone experiments show that under the same connectivity maintenance requirement, our approach plans trajectories that are about 10 times better than the baselines in terms of the ergodic metric.

10:40-10:45, Paper WeAT23.3
Self-Assembly Planning for Modular Robots Via Multi-Agent Path Finding on Time-Expanded Networks

Huang, Zhen	Beijing Institute of Technology
Cheng, Yajie	Beijing Institute of Technology
Shi, Lingling	Beijing Institute of Technology
Shan, Minghe	Beijing Institute of Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Cellular and Modular Robots, Assembly Abstract: Self-assembly planning for modular robots is critical for constructing functional structures, yet existing methods often suffer from inefficiency, poor scalability, or collision risks. This paper presents an innovative framework that formulates modular robot self-assembly as a time-varying online Multi-Agent Path Finding (MAPF) problem and resolves it through an enhanced Time-Expanded Network (TEN). Key modifications are introduced to handle the dynamic nature of the self-assembly process, including the varying number of agents and evolving target configurations. Simulations conducted with hexagonal modular robots demonstrate that the proposed algorithm significantly outperforms the benchmark A*-based approach in terms of both assembly efficiency and success rate across various target configurations. The proposed framework establishes a scalable planning framework for modular robot self-assembly, with future extensions toward real-world validation.

10:45-10:50, Paper WeAT23.4
Proactive Conflict Area Prediction for Boosting Search-Based Multi-Agent Pathfinding

Kwon, Youngjoon	Chung-Ang University
Lee, Kyungjae	Korea University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Deep Learning Methods, Planning, Scheduling and Coordination Abstract: The multi-agent pathfinding aims to compute conflict-free paths for multiple agents in shared environments. Traditional methods, such as conflict-based search (CBS), guarantee optimality but suffer from high computational costs due to constraint tree expansion. Learning-based approaches improve efficiency but often compromise solution quality. We propose proactive conflict-aware prediction (PCAP), which improves CBS by predicting conflict-prone areas based on constraint data. This approach enables a more informed constraint application, reducing unnecessary expansions while preserving optimality. Experimental results show that PCAP reduces computation time by 40% compared to CBS while maintaining solution quality.

10:50-10:55, Paper WeAT23.5
Collaborative Task Assignment, Sequencing and Multi-Agent Path-Finding

Bai, Yifan	Luleå University of Technology
Kotpalliwar, Shruti	Luleå Tekniska Universitet
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Task Planning Abstract: In this article, we address the problem of collaborative task assignment, sequencing, and multi-agent pathfinding (TSPF), where a team of agents must visit a set of task locations without collisions while minimizing flowtime. TSPF incorporates agent-task compatibility constraints and ensures that all tasks are completed. We propose a Conflict-Based Search with Task Sequencing (CBS-TS), an optimal and complete algorithm that alternates between finding new task sequences and resolving conflicts in the paths of current sequences. CBS-TS uses a mixed integer linear program (MILP) to optimize task sequencing and employs Conflict-Based Search (CBS) with Multi-Label A* (MLA*) for collision-free path planning within a search forest. By invoking MILP for the next-best sequence only when needed, CBS-TS efficiently limits the search space, enhancing computational efficiency while maintaining optimality. We compare the performance of our CBS-TS against Conflict- based Steiner Search (CBSS), a baseline method that, with minor modifications, can address the TSPF problem. Experimental results demonstrate that CBS-TS outperforms CBSS in most testing scenarios, achieving higher success rates and consistently optimal solutions, whereas CBSS achieves near-optimal solutions in some cases.

10:55-11:00, Paper WeAT23.6
Space-Time Graphs of Convex Sets for Multi-Robot Motion Planning

Tang, Jingtao	Simon Fraser University
Mao, Zining	Simon Fraser University
Yang, Lufan	Simon Fraser University
Ma, Hang	Simon Fraser University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Motion and Path Planning Abstract: We address the Multi-Robot Motion Planning (MRMP) problem of computing collision-free trajectories for multiple robots in shared continuous environments. While existing frameworks effectively decompose MRMP into single-robot subproblems, spatiotemporal motion planning with dynamic obstacles remains challenging, particularly in cluttered or narrow-corridor settings. We propose Space-Time Graphs of Convex Sets (ST-GCS), a novel planner that systematically covers the collision-free space-time domain with convex sets instead of relying on random sampling. By extending Graphs of Convex Sets (GCS) into the time dimension, ST-GCS formulates time-optimal trajectories in a unified convex optimization that naturally accommodates velocity bounds and flexible arrival times. We also propose Exact Convex Decomposition (ECD) to “reserve” trajectories as spatiotemporal obstacles, maintaining a collision-free space-time graph of convex sets for subsequent planning. Integrated into two prioritized-planning frameworks, ST-GCS consistently achieves higher success rates and better solution quality than state-of-the-art sampling-based planners---often at orders-of-magnitude faster runtimes---underscoring its benefits for MRMP in challenging settings. Project page: https://sites.google.com/view/stgcs.

11:00-11:05, Paper WeAT23.7
Online Concurrent Multi-Robot Coverage Path Planning

Mitra, Ratijit	IIT Kanpur
Saha, Indranil	IIT Kanpur
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning Abstract: Recently, centralized receding horizon online multi-robot coverage path planning algorithms have shown remarkable scalability in thoroughly exploring large, complex, unknown workspaces with many robots. In a horizon, the path planning and the path execution interleave, meaning when the path planning occurs for robots with no paths, the robots with outstanding paths do not execute, and subsequently, when the robots with new or outstanding paths execute to reach respective goals, path planning does not occur for those robots yet to get new paths, leading to wastage of both the robotic and the computation resources. As a remedy, we propose a centralized algorithm that is not horizon-based. It plans paths at any time for a subset of robots with no paths, i.e., who have reached their previously assigned goals, while the rest execute their outstanding paths, thereby enabling concurrent planning and execution. We formally prove that the proposed algorithm ensures complete coverage of an unknown workspace and analyze its time complexity. To demonstrate scalability, we evaluate our algorithm to cover eight large 2D grid benchmark workspaces with up to 512 aerial and ground robots, respectively. A comparison with two state-of-the-art horizon-based algorithms shows its superiority in completing the coverage with up to 1.6x speedup. For validation, we perform ROS + Gazebo simulations in six 2D grid benchmark workspaces with 10 Quadcopters and TurtleBots, respectively. We also successfully conducted one outdoor experiment with three quadcopters and one indoor with two TurtleBots.

11:05-11:10, Paper WeAT23.8
A Priority-Based Multi-Robot Search Algorithm for Indoor Source Searching (I)

Wang, Miao	Beijing Institute of Technology
Xin, Bin	Beijing Institute of Technology
Jing, Mengjie	Beijing Institute of Technology
Qu, Yun	Beijing Institute of Technology
Keywords: Search and Rescue Robots, Multi-Robot Systems, Task Planning Abstract: It is extremely important to quickly locate the source of a hazardous substance leak in order to reduce damage to life and property. Multi-robot source localization faces challenges in unknown indoor environments, such as navigating through dense environments, encountering large areas without airflow or concentration clues, experiencing frequent changes in robot measurements, and managing clusters of robots in confined spaces. This study proposes a priority-based multi-robot search algorithm to tackle these challenges. The algorithm consists of a priority-based search strategy, an exploration method based on frontier and Voronoi diagram, an airflow tracking method based on Rapidly-exploring Random Trees Star (RRT*), and a multi-robot collaborative method. The algorithm was compared with three other state-of-the-art algorithms in simulated environments, assessing varying team sizes, airflow speeds, and diverse scenarios. The algorithm was also evaluated in real-robots experiments. The evaluation results demonstrate that the algorithm exhibits outstanding performance in both simulated and real-robots experiments.


WeAT24	102C
Sensor Fusion 5	Regular Session
Chair: Xu, Sheng	Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Co-Chair: Li, Jianping	Nanyang Technological University

10:30-10:35, Paper WeAT24.1
An Easy Method for Extrinsic Calibration of Camera and Time-Of-Flight Sensor

Zhang, Tianyou	University of Nottingham
Liu, Jing	University of Nottingham
Axinte, Dragos	University of Nottingham
Dong, Xin	University of Nottingham
Keywords: Calibration and Identification, Sensor Fusion, Range Sensing Abstract: A multi-zone (typically 8×8) time-of-flight (ToF) sensor offers a low-cost, low-power, and compact solution for range measurement, making it ideal for specialized robotic applications. However, its low resolution limits its usability. Pairing a ToF sensor with a camera enhances depth perception and can solve the unscaled metric problem in mono depth estimation. Advances in deep learning further enable high-quality depth map reconstruction from ToF-camera data, providing a cost-effective alternative. However, accurate ToF-camera calibration remains a challenge due to ToF sensor’s coarse depth output. This work presents a simple yet effective method for the extrinsic calibration of a ToF sensor with an RGB camera using only a chessboard and two whiteboards. A tailored two-plane fitting algorithm is proposed specifically for the ToF sensor. Moreover, our approach leverages parallel lines with vanishing points and geometric constraints from plane intersections. This eliminates the need for robotic arm movements or SLAM-based sensor pose reconstruction, significantly reducing complexity while maintaining high accuracy. Experimental results demonstrate that our method lowers the root mean square (RMS) depth difference from 96.59 mm to 67.89 mm, underscoring its effectiveness in practical applications. Code is publicly available in https://github.com/Tianyou-Nottingham/ToF-Camera-Calibration.

10:35-10:40, Paper WeAT24.2
Direct, Targetless and Automatic Joint Calibration of LiDAR-Camera Intrinsic and Extrinsic

Shen, Yishu	Shanghai Jiao Tong University
Hong, Sheng	Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Qin, Tong	Shanghai Jiao Tong University
Keywords: Calibration and Identification, Sensor Fusion, SLAM Abstract: This paper presents a direct, targetless, and automatic LiDAR-Camera joint calibration method that effectively overcomes the intrinsic precision limitations. We propose an iterative two-stage optimization methodology that leverages 3D LiDAR measurements to simultaneously refine both intrinsic and extrinsic. In the first stage, the intrinsic is optimized using a normalized information distance (NID) metric, an information-theoretic measure that quantifies the statistical alignment between LiDAR and image intensities, while initial extrinsic parameters derived from CAD specifications facilitate the projection of LiDAR point clouds onto the camera image plane. In the second stage, the refined intrinsic guides further optimization of extrinsic using the same NID-based evaluation metrics. This alternating process iteratively enhances both intrinsic and extrinsic through their mutual interdependence. Experiments across multiple datasets demonstrate that our method achieves sub-pixel intrinsic accuracy and extrinsic parameters that closely align with CAD specifications, validating the superior performance of our methodology for sensor fusion applications.

10:40-10:45, Paper WeAT24.3
SPLiCE: Single-Point LiDAR and Camera Calibration & Estimation Leveraging Manhattan World

Kim, Minji	Gwangju Institute of Science and Technology
Han, Jeahn	Gwangju Institute of Science and Technology
Ham, Jungil	Gwangju Institute of Science and Technology
Kim, Pyojin	Gwangju Institute of Science and Technology (GIST)
Keywords: Calibration and Identification, Sensor Fusion, Range Sensing Abstract: We present a novel calibration method between single-point LiDAR and camera sensors utilizing an easy-to-build customized calibration board satisfying the Manhattan world (MW). Previous methods for LiDAR-camera (LC) calibration focus on line and plane correspondences. However, they require dense 3D point clouds from heavy and expensive LiDAR to simplify alignments; otherwise, these approaches fail for extremely sparse LiDAR. Compact, lightweight, and sparse LiDAR and camera sensors are inevitable for micro drones like Crazyflie with a maximum payload of 15 g, but there are no explicit calibration methods for them. To address these issues, we propose a new extrinsic calibration method with a new calibration board, which rotates like a door to capture geometric features and align them with images. Once we find an initial estimate, we refine the relative rotation by minimizing the angle difference between the grid orientation of the checkerboard and the MW axes. We demonstrate the effectiveness of the proposed method through various LC configurations, achieving its capability and high accuracy compared to other state-of-the-art approaches.

10:45-10:50, Paper WeAT24.4
A 4D Radar Camera Extrinsic Calibration Tool Based on 3D Uncertainty Perspective N Points

Cao, Chuan	Shanghai Jiao Tong University
Xiaoning, Wang	Ruijing Hospital, Shanghai Jiao Tong University School of Medici
Wenqian, Xi	Renji Hospital, ShangHai Jiao Tong University School of Medicine
Zhang, Han	Shanghai Jiao Tong University
Chen, Weidong	Shanghai Jiao Tong University
Wang, Jingchuan	Shanghai Jiao Tong University
Keywords: Calibration and Identification, Sensor Fusion Abstract: 4D imaging radar is a type of low-cost millimeter-wave radar(costing merely 10-20% of lidar systems) capable of providing range, azimuth, elevation, and Doppler velocity information. Accurate extrinsic calibration between millimeter-wave radar and camera systems is critical for robust multimodal perception in robotics, yet remains challenging due to inherent sensor noise characteristics and complex error propagation. This paper presents a systematic calibration framework to address critical challenges through a spatial 3d uncertainty-aware PnP algorithm (3DUPnP) that explicitly models spherical coordinate noise propagation in radar measurements, then compensating for non-zero error expectations during coordinate transformations. Finally, experimental validation demonstrates significant performance improvements over state-of-the-art CPnP baseline, including improved consistency in simulations and enhanced precision in physical experiments. This study provides a robust calibration solution for robotic systems equipped with millimeter-wave radar and cameras, tailored specifically for autonomous driving and robotic perception applications.

10:50-10:55, Paper WeAT24.5
Auto-Calibration of Camera Intrinsics and Extrinsics Using Lidar and Motion

Obdrzalek, Stepan	Czech Technical University
Matas, Jiri	Czech Technical University
Keywords: Sensor Fusion, Calibration and Identification, Omnidirectional Vision Abstract: A novel camera autocalibration method is presented. Any camera model can be calibrated, and no calibration targets like checkerboards are used. The method requires the camera to be mounted on a lidar-equipped moving platform travelling through a structured environment along a known path. The primary reason for cross-modal camera calibration is not to solve the sensor fusion problem, but to tap the huge amount of accurate metric data points available from the lidar. The amount of measurements is easily four orders of magnitude higher than in checkerboard based methods. This leads to improved estimation accuracy, especially of higher-order distortion coefficients. In a multi-camera setup, the lidar additionally defines a common reference coordinate system for all cameras. Compared to the majority of published methods on camera-lidar autocalibration, (i) our calibration procedure relies on motion features, (ii) the hard-to-obtain-accurately lidar-lidar and lidar-image feature correspondences are not required, and (iii) both camera extrinsics and intrinsics, including complex distortion models, are autocalibrated. Qualitative experiments show that the calibration accuracy reaches or exceeds the accuracy of methods relying on calibration targets.

10:55-11:00, Paper WeAT24.6
EventSync: Joint Recovery of Temporal Offsets and Relative Orientations for Wide-Baseline Event Cameras

Xing, Wanli	The University of Hong Kong
Lin, Shijie	The University of Hong Kong
Zheng, Guangze	The University of Hong Kong
Yang, Linhan	University of HongKong;
Du, Yanjun	The Chinese University of Hong Kong
Pan, Jia	University of Hong Kong
Keywords: Calibration and Identification, Sensor Fusion, Visual Tracking Abstract: Event-based cameras offer significant advantages due to their high temporal resolution and low power consumption. However, when deploying multiple such cameras, a critical challenge emerges: each camera operates on an independent time system, resulting in temporal misalignment that severely degrades performance in multi-event camera applications. Traditional hardware-based synchronization methods face significant limitations in compatibility and are impractical for wide-baseline configurations. We introduce EventSync, a software-based algorithm that achieves millisecond-level synchronization by exploiting the motion of objects in the cameras' shared field of view, while simultaneously estimating the relative orientation between cameras. Our approach eliminates the need for physical connections, making it particularly valuable for wide-baseline deployments. Through comprehensive evaluation in both simulated environments and real-world indoor/outdoor scenarios, we demonstrate robust synchronization accuracy and precise extrinsic calibration across varying camera configurations, significantly outperforming existing methods.

11:00-11:05, Paper WeAT24.7
LiMo-Calib: On-Site Fast LiDAR-Motor Calibration for Quadruped Robot-Based Panoramic 3D Sensing System

Li, Jianping	Nanyang Technological University
Liu, Zhongyuan	Nanyang Technological University
Xu, Xinhang	Nanyang Technological University
Qin, Xiong	Nanyang Technological University
Liu, Jinxin	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Fang, Xu	Dalian University of Technology
Xie, Lihua	NanyangTechnological University
Keywords: Calibration and Identification, Mapping, Legged Robots Abstract: Conventional single LiDAR systems are inherently constrained by their limited field of view (FoV), leading to blind spots and incomplete environmental awareness, particularly on robotic platforms with strict payload limitations. Integrating a motorized LiDAR offers a practical solution by significantly expanding the sensor’s FoV and enabling adaptive panoramic 3D sensing. However, the high-frequency vibrations of the quadruped robot introduce calibration challenges, causing variations in the LiDAR-motor transformation that degrade sensing accuracy. Existing calibration methods that use artificial targets or dense feature extraction lack feasibility for on-site applications and real-time implementation. To overcome these limitations, we propose LiMo-Calib, an efficient on-site calibration method that eliminates the need for external targets by leveraging geometric features directly from raw LiDAR scans. LiMo-Calib optimizes feature selection based on normal distribution to accelerate convergence while maintaining accuracy and incorporates a reweighting mechanism that evaluates local plane fitting quality to enhance robustness. We integrate and validate the proposed method on a motorized LiDAR system mounted on a quadruped robot, demonstrating significant improvements in calibration efficiency and 3D sensing accuracy, making LiMo-Calib well-suited for real-world robotic applications. We further demonstrate the accuracy improvements of the LIO on the panoramic 3D sensing system using the calibrated parameters. The code will be available at: url{https://github.com/kafeiyin00/LiMo-Calib}.

11:05-11:10, Paper WeAT24.8
A Stable Learning-Based Method for Robotic Assembly with Motion and Force Measurements (I)

Sheng, Juyi	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Tang, Yifeng	Chinese Academy of Sciences
Tan, Fangning	University of Chinese Academy of Sciences (UCAS)
Hou, Ruiming	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Xu, Sheng	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Xu, Tiantian	Chinese Academy of Sciences
Keywords: Assembly, Force Control, Learning from Demonstration Abstract: In this article, a learning-based controller is proposed to realize motion policy learning based on intuitive human demonstrations. The position, velocity, and force data during the demonstration are collected as input features without any physical contact with the human demonstrator, and an algorithm is designed to automatically label the data in combination with motion and force data. After the learning process, the robot can complete the assembly according to the human demonstrations, and the proposed controller will generate different angular acceleration commands as control inputs to help finish the manipulation well. Finally, a comprehensive analysis, including Lyapunov stability and Lipschitz constraint, is also provided to guarantee the stability and security of this learning-based controller. Sufficient experiments based on the real robot system verify the effectiveness of the proposed method.


WeAT25	103A
Legged Robots 1 - Locomotion	Regular Session
Chair: Kawaharazuka, Kento	The University of Tokyo
Co-Chair: Schwertfeger, Sören	ShanghaiTech University

10:30-10:35, Paper WeAT25.1
DecARt Leg: Design and Evaluation of a Novel Humanoid Robot Leg with Decoupled Actuation for Agile Locomotion

Davydenko, Egor	Moscow Institute of Physics and Technology
Volchenkov, Andrei	Moscow Institute of Physics and Technology
Gerasimov, Vladimir	Moscow Institute of Physics and Technology
Gorbachev, Roman	Moscow Institute of Physics and Technology
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Humanoid and Bipedal Locomotion Abstract: In this work, we presented a new design of an electrically actuated robotic leg, the "DecARt Leg", which implements a decoupled actuation scheme inherited from pantograph-like designs while maintaining an anthropomorphic appearance. We quantitatively evaluated its agile locomotion capabilities by modeling. To compare the proposed kinematic structure with other structures at design time, we proposed a new descriptive metric, called the “Fastest Achievable Swing Time” (FAST). Compared to other considered coupled and decoupled leg designs, the measured DecARt Leg's FAST metric value demonstrates a high capability to deliver fast and agile locomotion, which should be validated by hardware experiments during further research. The analysis of applying the proposed metric to various robot designs revealed some meaningful insights, such such as the direct performance difference between coupled and decoupled versions of the same base leg design, and the importance of actuator placement. The evaluation of the preliminary simulation model of the DecARt Leg-based robot across various tasks demonstrated significant potential for both locomotion and loco-manipulation.

10:35-10:40, Paper WeAT25.2
KLEIYN : A Quadruped Robot with an Active Waist for Both Locomotion and Wall Climbing

Yoneda, Keita	The University of Tokyo
Kawaharazuka, Kento	The University of Tokyo
Suzuki, Temma	The University of Tokyo
Hattori, Takahiro	The University of Tokyo
Okada, Kei	The University of Tokyo
Keywords: Legged Robots, Reinforcement Learning, Climbing Robots Abstract: In recent years, advancements in hardware have enabled quadruped robots to operate with high power and speed, while robust locomotion control using reinforcement learning (RL) has also been realized. As a result, expectations are rising for the automation of tasks such as material transport and exploration in unknown environments. However, autonomous locomotion in rough terrains with significant height variations requires vertical movement, and robots capable of performing such movements stably, along with their control methods, have not yet been fully established. In this study, we developed the quadruped robot KLEIYN, which features a waist joint, and aimed to expand quadruped locomotion by enabling chimney climbing through RL. To facilitate the learning of vertical motion, we introduced Contact-Guided Curriculum Learning (CGCL). As a result, KLEIYN successfully climbed walls ranging from 800 mm to 1000 mm in width at an average speed of 150 mm/s, 50 times faster than conventional robots. Furthermore, we demonstrated that the introduction of a waist joint improves climbing performance, particularly enhancing tracking ability on narrow walls.

10:40-10:45, Paper WeAT25.3
Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion

Bohlinger, Nico	TU Darmstadt
Kinzel, Jonathan Frederik	TU Darmstadt
Palenicek, Daniel	TU Darmstadt
Antczak, Łukasz	MAB Robotics Sp. Z O.o
Peters, Jan	Technische Universität Darmstadt
Keywords: Legged Robots, Reinforcement Learning, Autonomous Agents Abstract: On-robot Reinforcement Learning is a promising approach to train embodiment-aware policies for legged robots. However, the computational constraints of real-time learning on robots pose a significant challenge. We present a framework for efficiently learning quadruped locomotion in just 8 minutes of raw real-time training utilizing the sample efficiency and minimal computational overhead of the new off-policy algorithm CrossQ. We investigate two control architectures: Predicting joint target positions for agile, high-speed locomotion and Central Pattern Generators for stable, natural gaits. While prior work focused on learning simple forward gaits, our framework extends on-robot learning to omnidirectional locomotion. We demonstrate the robustness of our approach in different indoor and outdoor environments and provide the videos and code for our experiments at: https://nico-bohlinger.github.io/gait_in_eight_website

10:45-10:50, Paper WeAT25.4
A2I-Calib: An Anti-Noise Active Multi-IMU Spatial-Temporal Calibration Framework for Legged Robots

Xiong, Chaoran	Shanghai Jiao Tong University
Jiang, Fangyu	Shanghai Jiao Tong University
Ma, Kehui	Shanghai Jiao Tong University
Sun, Zhen	Shanghai Jiao Tong University
Zhang, Zeyu	Shanghai Jiao Tong University
Pei, Ling	Shanghai Jiao Tong University
Keywords: Legged Robots, Calibration and Identification, Reinforcement Learning Abstract: Recently, multi-node inertial measurement unit (IMU)-based odometry for legged robots has gained attention due to its cost-effectiveness, power efficiency, and high accuracy. However, the spatial and temporal misalignment between foot-end motion derived from forward kinematics and foot IMU measurements can introduce inconsistent constraints, resulting in odometry drift. Therefore, accurate spatial-temporal calibration is crucial for the multi-IMU systems. Although existing multi-IMU calibration methods have addressed passive single-rigid-body sensor calibration, they are inadequate for legged systems. This is due to the insufficient excitation from traditional gaits for calibration, and enlarged sensitivity to IMU noise during kinematic chain transformations. To address these challenges, we propose A2I-Calib, an anti-noise active multi-IMU calibration framework enabling autonomous spatial-temporal calibration for arbitrary foot-mounted IMUs. Our A2I-Calib includes: 1) an anti-noise trajectory generator leveraging a proposed basis function selection theorem to minimize the condition number in correlation analysis, thus reducing noise sensitivity, and 2) a reinforcement learning (RL)-based controller that ensures robust execution of calibration motions. Furthermore, AI-Calib is validated on simulation and real-world quadruped robot platforms with various multi-IMU settings, which demonstrates a significant reduction in noise sensitivity and calibration errors, thereby improving the overall multi-IMU odometry performance.

10:50-10:55, Paper WeAT25.5
Playful DoggyBot: Learning Agile and Precise Quadrupedal Locomotion

Duan, Xin	Shanghaitech University, Shanghai Qi Zhi Institute
Zhuang, Ziwen	Shanghai Qizhi Institute
Zhao, Hang	Tsinghua University
Schwertfeger, Sören	ShanghaiTech University
Keywords: Legged Robots, Reinforcement Learning, Mobile Manipulation Abstract: Quadrupedal animals can perform agile and playful tasks while interacting with real-world objects. For instance, a trained dog can track and catch a flying frisbee before it touches the ground, while a cat left alone at home may leap to grasp the door handle. Successfully grasping an object during high-dynamic locomotion requires highly precise perception and control. However, due to hardware limitations, agility and precision are usually a trade-off in robotics problems. In this work, we employ a perception-control decoupled system based on Reinforcement Learning (RL), aiming to explore the level of precision a quadrupedal robot can achieve while interacting with objects during high-dynamic locomotion. Our experiments show that our quadrupedal robot, mounted with a passive gripper in front of the robot's chassis, can perform both tracking and catching tasks similar to a real trained dog. The robot can follow a mid-air ball moving at speeds of up to 3m/s and it can leap and successfully catch a small object hanging above it at a height of 1.05m in simulation and 0.8m in the real world.

10:55-11:00, Paper WeAT25.6
An Insect-Scale Multimodal Amphibious Piezoelectric Robot

Wang, Le	Huzhou Vocational and Technical College
Wang, Xin	Huzhou Institute of Zhejiang University
Wang, Hanlin	State Key Laboratory of Industrial Control Technology and the In
Xiqing, Zuo	Huzhou Institute of Zhejiang University
Xu, Chao	Zhejiang University
Keywords: Legged Robots, Biologically-Inspired Robots, Mechanism Design Abstract: Miniature amphibious robots are capable of performing various tasks in complex terrestrial and aquatic environments due to their superior adaptability. However, the mobility of existing miniature amphibious robots in such environments is limited by their complex locomotion systems and single mode of motion. This work presents a novel insect-scale amphibious robot, powered by a single piezoelectric actuator. The prototype of the robot is fabricated and preliminarily tested preliminarily. By exploiting the different vibration modes of the piezoelectric actuators, the robot achieves movement in an amphibious environment. The robot employs the acoustic flow generated by the higher-order mode to achieve rapid motion at the water surface. In addition, the robot attains forward and backward motion on the ground by means of friction force between the driving feet and the ground. The findings of this study offer significant insights into the development of amphibious robots that exhibit enhanced flexibility and adaptability. These insights lay the foundation for the future applications of such robots in narrow amphibious settings.

11:00-11:05, Paper WeAT25.7
Design of Q8bot: A Miniature, Low-Cost, Dynamic Quadruped Built with Zero Wires

Wu, Yufeng	University of California Los Angeles
Hong, Dennis	UCLA
Keywords: Legged Robots, Education Robotics, Mechanism Design Abstract: This paper introduces Q8bot, an open-source, miniature quadruped designed for robotics research and education. We present the robot's novel, zero-wire design methodology, which leads to its superior form factor, robustness, replicability, and high performance. With the size and weight similar to a modern smartphone, this standalone robot can walk hour-long on a single battery charge and can survive meter-high drops with simple repairs. Its 300 bill of materials contains minimal off-the-shelf components, readily available custom electronics from online vendors, and structural parts that can be manufactured on hobbyist 3D-printers. A preliminary user assembly study confirms that Q8bot can be easily replicated, with an average assembly time of under one hour by a single person. With rudimentary open-loop control, Q8bot is capable of a stable walking speed of 5.4 body lengths per second and a turning speed of 5 radians per second, along with other dynamic movements such as jumping and climbing moderate slopes.

11:05-11:10, Paper WeAT25.8
Helpful DoggyBot: Open-World Object Fetching Using Legged Robots and Vision-Language Models

Wu, Qi	Stanford University
Fu, Zipeng	Stanford University
Cheng, Xuxin	University of California, San Diego
Wang, Xiaolong	UC San Diego
Finn, Chelsea	Stanford University
Keywords: Sensorimotor Learning, Legged Robots Abstract: Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic under-standing using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user’s commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate.


WeAT26	103B
Localization 1	Regular Session
Chair: Tasaki, Tsuyoshi	Meijo University
Co-Chair: Ou, Yongsheng	Dalian University of Technology

10:30-10:35, Paper WeAT26.1
One-Shot Global Localization through Semantic Distribution Feature Retrieval and Semantic Topological Histogram Registration

Huang, Feixuan	Southeast University
Liu, Hong	Southeast University
Gao, Wang	Southeast University
Pan, Shuguo	Southeast University
Zhao, Heng	Southeast University
Keywords: Localization, SLAM Abstract: One-shot global localization is crucial in many robotic applications, providing significant advantages during initialization and relocalization processes. However, LiDAR-based one-shot global localization methods encounter challenges, including local feature matching errors, sensitivity to dynamic objects, and computational complexity in the absence of an initial pose. To address these issues, we propose a one-shot LiDAR-semantic-graph-based global localization method. To mitigate the interference of dynamic objects on localization, we extract stable semantic objects from LiDAR point clouds using dynamic curved voxel clustering and subsequently construct a semantic graph. Furthermore, we leverage the distribution characteristics of the semantic objects to quickly filter candidate retrievals and construct a cost matrix for the Hungarian algorithm, utilizing a semantic topological histogram to solve vertex matching. This yields a coarse pose estimate, which is subsequently refined using Fast-GICP. We demonstrate the superior localization performance compared to existing state-of-the-art methods on multiple large-scale outdoor datasets, including MulRan, MCD, and Apollo. Our method will be open-sourced and accessible at: https://github.com/Hfx-J/SGGL.

10:35-10:40, Paper WeAT26.2
Spatial Graph Attentional Network Based Place Recognition with Visual Mamba Embedding

Li, Kunmo	Dalian University of Technology
Ou, Yongsheng	Dalian University of Technology
Cai, Haiyang	Dalian University of Technology
Ning, Jian	Wuhan University
Qi, Man	Northeastern University
Keywords: Localization, SLAM Abstract: Visual Place Recognition (VPR) plays a vital role in mobile robotics and autonomous navigation by retrieving reference images from a pre-established database. However, VPR systems frequently encounter performance degradation due to environmental variations. To overcome these challenges, we propose a re-ranking based VPR framework incorporating two key components: (1) A Visual Mamba Embedding (VME) module that optimizes spatial-channel feature interactions to generate discriminative global descriptors; and (2) A Spatial Graph Attentional Network (SGAN) that replaces conventional RANSAC-based verification with an efficient graph attention mechanism, improving matching accuracy while reducing computation. Comprehensive evaluations across multiple benchmark datasets demonstrate that the proposed method achieves superior performance compared to existing state-of-the-art methods, while maintaining advantages in computational efficiency and storage requirements.

10:40-10:45, Paper WeAT26.3
NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization

Li, Xun	East China Normal University
Yang, Jian	Information Engineering University
Jia, Fenli	Information Engineering University
Wang, Muyu	East China Normal University
Wu, Jun	East China Normal University
Mi, Jinpeng	USST
Hu, Jilin	East China Normal University
Peidong, Liang	Harbin Institute of Technology
Tang, Xuan	East China Normal University
Li, Ke	Information Engineering University
You, Xiong	Information Engineering University
Wei, Xian	East China Normal University
Keywords: Localization, Bioinspired Robot Learning, Deep Learning for Visual Perception Abstract: Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the brain cognitive navigation mechanism (such as grid cells, place cells, and head direction cells), we propose a novel neurobiological camera location method, namely NeuroLoc. Firstly, we designed a Hebbian learning module driven by place cells to save and replay historical information, aiming to restore the details of historical representations and solve the issue of scene fuzziness. Secondly, we utilized the head direction cell-inspired internal direction learning as multi-head attention embedding to help restore the true orientation in similar scenes. Finally, we added a 3D grid center prediction in the pose regression module to reduce the final wrong prediction. We evaluate the proposed NeuroLoc on commonly used benchmark indoor and outdoor datasets. The experimental results show that our NeuroLoc can enhance the robustness in complex environments and improve the performance of pose regression by using only a single image.

10:45-10:50, Paper WeAT26.4
DogLegs: Robust Proprioceptive State Estimation for Legged Robots Using Multiple Leg-Mounted IMUs

Wu, Yibin	University of Bonn
Kuang, Jian	Wuhan University
Khorshidi, Shahram	University of Bonn
Niu, Xiaoji	Wuhan University
Klingbeil, Lasse	University of Bonn
Bennewitz, Maren	University of Bonn
Kuhlmann, Heiner	University of Bonn
Keywords: Localization, Legged Robots, Sensor Fusion Abstract: Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders, and multiple leg-mounted IMUs (Leg-IMU) using an extended Kalman filter (EKF). The filter system contains the error states of all IMU frames. The Leg-IMUs are used to detect foot contact, thereby providing zero-velocity measurements to update the state of the Leg-IMU frames. Additionally, we compute the relative position constraints between the Body-IMU and Leg-IMUs by the leg kinematics and use them to update the main body state and reduce the error drift of the individual IMU frames. Field experimental results have shown that our proposed DogLegs system achieves better state estimation accuracy compared to the traditional leg odometry method (using only Body-IMU and joint encoders) across various terrains. We make our datasets publicly available to benefit the research community (https://github.com/YibinWu/leg-odometry).

10:50-10:55, Paper WeAT26.5
GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving

Qi, Zhangshuo	Beijing Institute of Technology
Ma, Junyi	Beijing Institute of Technology
Xu, Jingyi	Shanghai Jiao Tong University
Zhou, Zijie	Beijing Institute of Technology
Cheng, Luqi	Beijing Institute of Technology
Xiong, Guangming	Beijing Institute of Technology
Keywords: Localization, Sensor Fusion, SLAM Abstract: Place recognition is a crucial component that enables autonomous vehicles to obtain localization results in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention. They overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. Most existing methods explore cross-modality correlations through feature-level or descriptor-level fusion. Conversely, the recently proposed 3D Gaussian Splatting provides a new perspective on multimodal spatio-temporal fusion by harmonizing temporally continuous multimodal data into an explicit scene representation. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract global descriptors from the Gaussian scenes for place recognition. Extensive evaluations on three datasets demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code will be released at https://github.com/QiZS-BIT/GSPR.

10:55-11:00, Paper WeAT26.6
SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation

Xu, Beining	Shanghai Jiao Tong University
Zhu, Siting	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Localization Abstract: We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at https://github.com/IRMVLab/SGLoc.

11:00-11:05, Paper WeAT26.7
Underwater Target 6D State Estimation Via UUV Attitude Enhance Observability

Liu, Fen	Nanyang Technological University
Jia, Chengfeng	Nanyang Technological University
Zhang, Na	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Su, Rong	Nanyang Technological University
Keywords: Localization, Range Sensing, Distributed Robot Systems Abstract: Accurate relative state observation of Unmanned Underwater Vehicles (UUVs) for tracking uncooperative targets remains a significant challenge due to the absence of GPS, complex underwater dynamics, and sensor limitations. Existing localization approaches rely on either global positioning infrastructure or multi-UUV collaboration, both of which are impractical for a single UUV operating in large or unknown environments. To address this, we propose a novel persistent relative 6D state estimation framework that enables a single UUV to estimate its relative motion to a non-cooperative target using only successive noisy range measurements from two monostatic sonar sensors. Our key contribution is an observability-enhanced attitude control strategy, which optimally adjusts the UUV’s orientation to improve the observability of relative position estimation using a Kalman filter, effectively mitigating the impact of sensor noise and drift accumulation. Additionally, we introduce a rigorously proven Lyapunov-based tracking control strategy that guarantees long-term stability by ensuring that the UUV maintains an optimal measurement range, preventing localization errors from diverging over time. Through theoretical analysis and simulations, we demonstrate that our method significantly improves 6D relative state estimation accuracy and robustness compared to conventional approaches. This work provides a scalable, infrastructure-free solution for UUVs tracking uncooperative targets underwater.

11:05-11:10, Paper WeAT26.8
Heterogeneous Graph Network-Based UWB Localization for Complex Indoor Environments

Yang, Bo	Nanjing University of Information Science & Technology
Li, Luyang	Nanjing University of Information Science & Technology
He, Sizhen	Nanjing University of Information Science and Technology
Chen, Weinan	Guangdong University of Technology
Zhang, Hong	SUSTech
Keywords: Localization Abstract: Accurate indoor location-based services are important for mobile robots, especially in complex indoor environments. In this paper, we propose a heterogeneous graph network-based ultra-wide band (UWB) localization method to provide accurate and robust localization results for mobile robots in complex indoor scenarios. The core of our approach lies in constructing the anchors, ranging measurements and tags into a heterogeneous graph structure according to the topological structure of the UWB localization system, and then design a spatial-temporal heterogeneous graph attention neural network to extract high-level features and estimate the tag locations from the graph. Therefore, the geometric relationships contained in the UWB localization system are comprehensively established, while the spatial and temporal information contained in the ranging measurements can also be extracted. We validate the proposed method through real-world experiments. The results demonstrate that, compared to existing deep learning-based methods, the constructed heterogeneous graph better represents the geometric structure of the UWB localization system, and the designed heterogeneous graph neural network effectively extracts the spatial-temporal and geometric features. Consequently, the accuracy and robustness of UWB localization are significantly improved.


WeAT27	103C
Performance Evaluation and Benchmarking 1	Regular Session
Chair: Heredia, Juan	University of Southern Denmark
Co-Chair: Jiang, Chiyu	Waymo LLC

10:30-10:35, Paper WeAT27.1
Towards Robust Sensor-Fusion Ground SLAM: A Comprehensive Benchmark and a Resilient Framework

Zhang, Deteng	Northwestern Polytechnical University
Zhang, Junjie	Chongqing University
Sun, Yan	Nankai University
Li, Tao	Zhejiang University of Technology
Yin, Hao	ShangHai JiaoTong University
Xie, Hongzhao	Beijing Institute for General Artificial Intelligence (BIGAI)
Yin, Jie	Shanghai Jiao Tong University
Keywords: Performance Evaluation and Benchmarking, Data Sets for SLAM, Sensor Fusion Abstract: Significant progress has been made in SLAM algorithms for structured environments, yet their robustness under challenging corner cases remains a critical limitation. Although multi-sensor fusion approaches integrating diverse sensors have shown promising performance improvements, the research community faces two key barriers: On one hand, the lack of standardized and configurable benchmarks that systematically evaluate SLAM algorithms under diverse degradation scenarios hinders comprehensive performance assessment. While on the other hand, existing SLAM frameworks primarily focus on fusing a limited set of sensor types, without effectively addressing adaptive sensor selection strategies for varying environmental conditions. To bridge these gaps, we make three key contributions: First, we introduce M3DGR dataset: a sensor-rich benchmark with systematically induced degradation patterns including visual challenge, LiDAR degeneracy, wheel slippage and GNSS denial. Second, we conduct a comprehensive evaluation of forty SLAM systems on M3DGR, providing critical insights into their robustness and limitations under challenging real-world conditions. Third, we develop a resilient modular multi-sensor fusion framework named Ground-Fusion++, which demonstrates robust performance by coupling GNSS, RGB-D, LiDAR, IMU (Inertial Measurement Unit) and wheel odometry. All codes and datasets are publicly available.

10:35-10:40, Paper WeAT27.2
The Foundation for Tactile Robots: Approaching the Holistic Analysis of a Robot’s Force Sensing Capabilities

Kirschner, Robin Jeanne	TU Munich, Institute for Robotics and Systems Intelligence
Siegner, Sebastian Julian	TU Munich
Karacan, Kübra	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Performance Evaluation and Benchmarking, Intelligent and Flexible Manufacturing, Safety in HRI Abstract: Contact estimation and force sensing are fundamental requirements for sensitive manipulation and safe physical human-robot interaction. The robot controllers that enable these functions rely on accurate and precise sensing. The performance of external force estimation is influenced by the design of the robot's sensory system. And similar to how humans prefer specific arm configurations for performing precise and delicate tasks, e.g., drawing a thin, straight line, robots also have "sweet spots" that allow for the most accurate performance of tasks based on their sensing capabilities. To fully exploit a robot's proprioceptive force sensing, it is essential to provide robot integrators, designers, and simulations with knowledge about these optimal settings including factors such as joint configurations, temperatures, and many more. This paper first investigates which of these factors are most relevant and how they can be best measured and based on that introduces force sensing error maps as a tool for structured research on robot force sensing performance and future developments of tactile robot applications. We first investigate the factors influencing the force sensing performance of 7-degree-of-freedom robots on the example of a Kinova Gen3 and then derive 2-dimensional Cartesian force sensing error maps for this robot, an LWR iiwa 14, and a Franka Emika robot. These maps enable comparison of robot sensing capabilities, revealing patterns and weak spots to guide application design toward more tactile areas.

10:40-10:45, Paper WeAT27.3
Bench4Merge: A Comprehensive Benchmark for Merging in Realistic Dense Traffic with Micro-Interactive Vehicles

Wang, Zhengming	Zhejing University
Wang, Junli	Institute of Automation, Chinese Academy of Sciences
Li, Pengfei	Institute for AI Industry Research (AIR), Tsinghua University
Li, Zhaohan	Wilbraham Manson Academy
Liu, Chunyang	DiDi Chuxing
Zhang, Bo	DIdi Inc
Li, Peng	Tsinghua University
Chen, Yilun	Tsinghua University
Keywords: Performance Evaluation and Benchmarking, Motion and Path Planning, Imitation Learning Abstract: While the capabilities of autonomous driving have advanced rapidly, merging into dense traffic remains a significant challenge, many motion planning methods for this scenario have been proposed but it is hard to evaluate them. Most existing closed-loop simulators rely on rule-based controls for other vehicles, which results in a lack of diversity and randomness, thus failing to accurately assess the motion planning capabilities in highly interactive scenarios. Moreover, traditional evaluation metrics are insufficient for comprehensively evaluating the performance of merging in dense traffic. In response, we proposed a closed-loop evaluation benchmark for assessing motion planning capabilities in merging scenarios. Our approach involves other vehicles trained in large scale datasets with micro-behavioral characteristics that significantly enhance the complexity and diversity. Additionally, we have restructured the evaluation mechanism by leveraging Large Language Models (LLMs) to assess each autonomous vehicle merging onto the main lane. Extensive experiments and test-vehicle deployment have demonstrated the progressiveness of this benchmark. Through this benchmark, we have obtained an evaluation of existing methods and identified common issues. The simulation environment and evaluation process can be accessed at url{https://github.com/WZM5853/Bench4Merge}.

10:45-10:50, Paper WeAT27.4
Evaluating Robot Program Performance Based on Power Consumption

Heredia, Juan	University of Southern Denmark
Stubbe Kolvig-Raun, Emil	PhD Fellow, University of Southern Denmark, Universal Robots
Sørensen, Sune Lundø	University of Southern Denmark
Mikkel, Kjærgaard	University of Southern Denmark
Keywords: Performance Evaluation and Benchmarking, Sustainable Production and Service Automation, Industrial Robots Abstract: The code performance of industrial robots is typically analyzed through CPU metrics, which overlook the physical impact of code on robot behavior. This study introduces a novel framework for assessing robot program performance from an embodiment perspective by analyzing the robot’s electrical power profile. Our approach diverges from conventional CPU-based evaluations and instead leverages a suite of normalized metrics, namely, the energy utilization coefficient, the energy conversion metric, and the reliability coefficient, to capture how efficiently and reliably energy is used during task execution. Complementing these metrics, the established robot wear metric provides further insight into long-term reliability. Our approach is demonstrated through an experimental case study in machine tending, comparing four programs with diverse strategies using a UR5e robot. The proposed metrics directly compare and categorize different robot programs, regardless of the specific task, by linking code performance to its physical manifestation through power consumption patterns. Our results reveal the strengths and weaknesses of each strategy, offering actionable insights for optimizing robot programming practices. Enhancing energy efficiency and reliability through this embodiment-centric approach not only improves individual robot performance but also supports broader industrial objectives such as sustainable manufacturing and cost reduction.

10:50-10:55, Paper WeAT27.5
Can Real-To-Sim Approaches Capture Dynamic Fabric Behavior for Robotic Fabric Manipulation?

Ru, Yingdong	University of Glasgow
Zhuang, Lipeng	University of Glasgow
He, Zhuo	University of Glasgow
Audonnet, Florent P.	University of Glasgow
Aragon-Camarasa, Gerardo	University of Glasgow
Keywords: Performance Evaluation and Benchmarking, Data Sets for Robotic Vision, Simulation and Animation Abstract: This paper presents a rigorous evaluation of Real-to-Sim parameter estimation approaches for fabric manipulation in robotics. The study systematically assesses three state-of-the-art approaches, namely two differential pipelines and a data-driven approach. We also devise a novel physics-informed neural network approach for physics parameter estimation. These approaches are interfaced with two simulations across multiple Real-to-Sim scenarios (lifting, wind blowing, and stretching) for five different fabric types and evaluated on three unseen scenarios (folding, fling, and shaking). We found that the simulation engines and the choice of Real-to-Sim approaches significantly impact fabric manipulation performance in our evaluation scenarios. Moreover, PINN observes superior performance in quasi-static tasks but shows limitations in dynamic scenarios. Videos and source code are available at cvas-ug.github.io/real2sim-study.

10:55-11:00, Paper WeAT27.6
Drive&Gen: Co-Evaluating End-To-End Driving and Video Generation Models

Wang, Jiahao	Johns Hopkins University
Yang, Zhenpei	Waymo
Bai, Yijing	Waymo
Li, Yingwei	Waymo
Zou, Yuliang	Waymo
Sun, Bo	Waymo
Kundu, Abhijit	Georgia Tech
Lezama, Jose	Google DeepMind
Huang, Yue	Waymo LLC
Zhu, Zehao	Waymo
Hwang, Jyh-Jing	Waymo
Anguelov, Dragomir	Waymo
Tan, Mingxing	Waymo Research
Jiang, Chiyu	Waymo LLC
Keywords: Performance Evaluation and Benchmarking, Autonomous Agents, Computer Vision for Automation Abstract: Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

11:00-11:05, Paper WeAT27.7
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics

Jaafar, Ahmed	Brown University
Sundara Raman, Shreyas	Brown University
Harithas, Sudarshan S	Brown University
Wei, Yichen	Brown University
Juliani, Sofia	Rutgers University
Wernerfelt, Anneke	University of Pennsylvania
Quartey, Benedict	Brown University
Idrees, Ifrah	Brown University
Liu, Jason Xinyu	Brown University
Tellex, Stefanie	Brown
Keywords: Performance Evaluation and Benchmarking, Data Sets for Robot Learning, Mobile Manipulation Abstract: Learning to execute long-horizon mobile manipulation tasks is crucial for advancing robotics in household and workplace settings. However, current approaches are typically data-inefficient, underscoring the need for improved models that require realistically sized benchmarks to evaluate their efficiency. To address this, we introduce the LAMBDA (λ) benchmark––Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities––which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We leverage λ to benchmark current end-to-end learning methods and a modular neuro-symbolic approach that combines foundation models with task and motion planning. We find that learning methods, even when pretrained, yield lower success rates, while a neuro-symbolic method performs significantly better and requires less data.


WeAT28	104
Marine Robotics 5	Regular Session
Chair: Wang, Wei	University of Wisconsin-Madison
Co-Chair: Wang, Yang	Shanghaitech University

10:30-10:35, Paper WeAT28.1
Dynamic Modeling and Efficient Data-Driven Optimal Control for Micro Autonomous Surface Vehicles

Chen, Zhiheng	Cornell University
Wang, Wei	University of Wisconsin-Madison
Keywords: Marine Robotics, Model Learning for Control, Optimization and Optimal Control Abstract: Micro Autonomous Surface Vehicles (MicroASVs) offer significant potential for operations in confined or shallow waters and swarm robotics applications. However, achieving precise and robust control at such small scales remains highly challenging, mainly due to the complexity of modeling nonlinear hydrodynamic forces and the increased sensitivity to self-motion effects and environmental disturbances, including waves and boundary effects in confined spaces. This paper presents a physics-driven dynamics model for an over-actuated MicroASV and introduces a data-driven optimal control framework that leverages a weak formulation-based online model learning method. Our approach continuously refines the physics-driven model in real time, enabling adaptive control that adjusts to changing system parameters. Simulation results demonstrate that the proposed method substantially enhances trajectory tracking accuracy and robustness, even under unknown payloads and external disturbances. These findings highlight the potential of data-driven online learning-based optimal control to improve MicroASV performance, paving the way for more reliable and precise autonomous surface vehicle operations.

10:35-10:40, Paper WeAT28.2
Aqua Slide: An Underwater Leveling Motion Scheme for M-UAAV Utilizing Singularity (I)

Huang, Dongyue	The Chinese University of Hong Kong
Dou, Minghao	The Chinese University of Hong Kong
Liu, Xuchen	The Chinese University of Hong Kong
Wang, Xinyi	University of Michigan
Wang, Chenggang	Shanghai Jiao Tong University
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Marine Robotics, Motion Control, Field Robots Abstract: The underwater leveling motion control of a morphable unmanned aerial-aquatic vehicle (M-UAAV) is essential but lacks an elegant solution. In this article, a scheme named Aqua Slide is proposed for Mirs-Alioth, our predeveloped M-UAAV prototype, to achieve underwater leveling motions by utilizing singularity. For the first time, the Singular Thrust Tilt Angle (STTA), i.e., the tilt angle that causes a vehicle to reach singularity, is characterized and defined. For underwater leveling motion generation, STTA is utilized as a key tool. Crucial factors of these motions, i.e., the existence of control direction uncertainty, and coupling effects are identified. Aqua Slide is proposed to tackle the crucial factors in generating underwater leveling motions, consisting of an STTA controller, a primary-auxiliary switch, and an auxiliary controller incorporating the Saturated Nussbaum function. A simulation environment, enriched with a detailed model, is established using a Gazebo and Hardware-In-The-Loop (HITL) configuration. Under similar control parameter settings, full-cycle Gazebo-HITL simulations and experiments reveal the effectiveness of the approach. The vehicle maintains the attitude within a maximum of 5 degrees during the whole procedure. Meanwhile, the ablation studies in Aqua Slide highlight the indispensability of the sub-modules in the approach. The absence of any module results in undesirable or even divergent behavior of Mirs-Alioth during underwater leveling motions. The preliminary exploration in this article elegantly solves the issue of generating underwater leveling motion for an M-UAAV and provides a referential solution for transitioning from simulation to real-world applications.

10:40-10:45, Paper WeAT28.3
Control Marine Vehicles with Azimuth Thrusters Using Convex Constrained Quadratic Programming

Zhou, Mingxi	University of Rhode Island
Naderi, Farhang	University of Rhode Island
Yuan, Chengzhi	University of Rhode Island
Keywords: Marine Robotics, Motion Control Abstract: Azimuth thrusters are widely used for controlling marine vehicles, especially, for dynamic positioning and hovering purposes. However, including azimuth thruster makes control allocation a nonlinear non-convex problem which is commonly solved using nonlinear programming methods, simplified by paring azimuth thrusters (e.g., two thrusters will always move at the same angle), or locally linearized using approximation equations such as Taylor series expansions and polynomial functions. In this paper, a new approach is presented to modify the azimuth thruster control allocation problem into a convex quadratic problem with a new force decomposition and linear first-order inequality constraints. As a result, the complexity of the control allocation increases linearly with respect to the number of azimuth thrusters, allowing it to be implemented on the marine vehicles with increased numbers of azimuth thrusters controlled independently and can be solved using constrained quadratic programming solvers. Case studies has been presented to validate the proposed method on simulated Autonomous Underwater Vehicles (AUVs) with two and four azimuth thrusters configured with different azimuth angle limits (±45, ±90, and ±135 degrees). The results shows excellent control performance of the proposed approach in controlling multiple states (surge, pitch, yaw, depth and sway) simultaneously, even when experiencing a cross-track ocean current. Recommendation on hardware implementation is also discussed for real world platform integration.

10:45-10:50, Paper WeAT28.4
Wave-Aware Control of Workspace-Constrained Shipboard Robots for Motion Compensation in Rough Seas

Kong, Lingda	Shanghai Jiao Tong University
Gao, Zhen	Shanghai Jiao Tong University
Keywords: Marine Robotics, Parallel Robots, Motion Control Abstract: In this work, we introduce a predictive control framework enabling shipboard robots to execute highly dynamic maneuvers for motion compensation in rough sea conditions. Such offshore operations poses significant challenges to traditional feedback controllers in maintaining workspace constraints under extreme wave disturbances, while real-time feasibility remains a challenge for model-based planning methods, given the limited predictability of future dynamics and the computational demands of extended planning horizons. To address these challenges, we propose a hierarchical planner and a model predictive controller that integrate real-time deterministic wave forecasting with ship motion prediction to enable anticipatory maneuver planning and execution in dynamic offshore environments. We apply this framework to a Stewart-Gangway system onboard a service operation vessel, featuring a 6-DoF parallel mechanism designed for precise motion control in offshore operations. Numerical experiments demonstrate that our approach significantly outperforms traditional reactive methods in stabilizing shipboard platforms under mild sea states. Most importantly, it effectively extends the operational capabilities of the platform across a broader spectrum of sea conditions. Our study demonstrates how wave forecasting can be leveraged to enhance the operational capabilities of shipboard robotic platforms through predictive control and by exploiting their inherent agility.

10:50-10:55, Paper WeAT28.5
From 2D Underwater Imaging Sonar Data to 3D Plane Extraction

Oliveira, António J.	INESC TEC
Ferreira, Bruno	INESC TEC
Cruz, Nuno	University of Porto
Keywords: Marine Robotics, Range Sensing, SLAM Abstract: Planar surfaces are commonly found in man-made underwater environments and can be employed to support underwater SLAM. This work focuses on 3D plane extraction, building on two-dimensional acoustic scans collected from an imaging sonar. The novel contribution of our algorithm exploits the sonar's wider beamwidth and ability to collect secondary echoes from these structures to extract a three-dimensional surface from the acquired acoustic image. Building on a Hough Transform-based algorithm adapted to polar-based acoustic imagery, line feature detection supports plane representation segmentation. An inverse sensor model is subsequently employed to estimate additional plane parameters: inclination, length, and height. Experimental assessment in a confined controlled environment is introduced, validating the accuracy of the algorithm. Additional results from a dam shaft scenario are also presented to assess the potential of the developed tool.

10:55-11:00, Paper WeAT28.6
Never Too Prim to Swim: An LLM-Enhanced RL-Based Adaptive S-Surface Controller for AUVs under Extreme Sea Conditions

Xie, Guanwen	Tsinghua University
Xu, Jingzehua	Tsinghua University
Ding, Yimian	Tsinghua University
Zhang, Zhi	Tsinghua University
Zhang, Shuai	New Jersey Institute of Technology
Li, Yi	Tsinghua University
Keywords: Marine Robotics, Reinforcement Learning, Machine Learning for Robot Control Abstract: The adaptivity and maneuvering capabilities of Autonomous Underwater Vehicles (AUVs) have drawn significant attention in oceanic research, due to the unpredictable disturbances and strong coupling among the AUV's degrees of freedom. In this paper, we developed large language model (LLM)-enhanced reinforcement learning (RL)-based adaptive S-surface controller for AUVs. Specifically, LLMs are introduced for the joint optimization of controller parameters and reward functions in RL training. Using multi-modal and structured explicit task feedback, LLMs enable joint adjustments, balance multiple objectives, and enhance task-oriented performance and adaptability. In the proposed controller, the RL policy focuses on upper-level tasks, outputting task-oriented high-level commands that the S-surface controller then converts into control signals, ensuring cancellation of nonlinear effects and unpredictable external disturbances in extreme sea conditions. Under extreme sea conditions involving complex terrain, waves, and currents, the proposed controller demonstrates superior performance and adaptability in high-level tasks such as underwater target tracking and data collection, outperforming traditional PID and SMC controllers.

11:00-11:05, Paper WeAT28.7
Make Your AUV Adaptive: An Environment-Aware Reinforcement Learning Framework for Underwater Tasks

Ding, Yimian	Tsinghua University
Xu, Jingzehua	Tsinghua University
Xie, Guanwen	Tsinghua University
Zhang, Shuai	New Jersey Institute of Technology
Li, Yi	Tsinghua University
Keywords: Marine Robotics, Reinforcement Learning, Machine Learning for Robot Control Abstract: This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical environmental information into the state space. This integration facilitates real-time environmental adaptation, significantly enhancing the AUV's situational awareness and decision-making capabilities. Furthermore, the framework incorporates AUV structure characteristics into the optimization process, employing a large language model (LLM)-based iterative refinement mechanism that leverages both environmental conditions and training outcomes to optimize task performance. Comprehensive experimental evaluations demonstrate the framework's superior performance, robustness and adaptability.

11:05-11:10, Paper WeAT28.8
Learning Agile Swimming: An End-To-End Approach without CPGs

Lin, Xiaozhu	ShanghaiTech University
Liu, Xiaopei	SHANGHAITECH UNIVERSITY
Wang, Yang	Shanghaitech University
Keywords: Marine Robotics, Reinforcement Learning, Motion Control Abstract: The pursuit of agile and efficient underwater robots, especially bio-mimetic robotic fish, has been impeded by challenges in creating motion controllers that are able to fully exploit their hydrodynamic capabilities. This paper addresses these challenges by introducing a novel, model-free, end-to-end control framework that leverages Deep Reinforcement Learning (DRL) to enable agile and energy-efficient swimming of robotic fish. Unlike existing methods that rely on predefined trigonometric swimming patterns like Central Pattern Generators (CPG), our approach directly outputs low-level actuator commands without strong constraints, enabling the robotic fish to learn agile swimming behaviors. In addition, by integrating a high performance Computational Fluid Dynamics (CFD) simulator with innovative sim-to-real strategies, such as normalized density calibration and servo response calibration, the proposed framework significantly mitigates the sim-to-real gap, facilitating direct transfer of control policies to real-world environments without fine-tuning. Comparative experiments demonstrate that our method achieves faster swimming speeds, smaller turnaround radii, and reduced energy consumption compared to the state-of-the-art swimming controllers. Furthermore, the proposed framework shows promise in addressing complex tasks, paving the way for more effective deployment of robotic fish in real aquatic environments.


WeAT29	105
SLAM 5	Regular Session
Chair: Civera, Javier	Universidad De Zaragoza
Co-Chair: Hong, Jungseok	MIT

10:30-10:35, Paper WeAT29.1
VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets

Fontan, Alejandro	Queensland University of Technology
Civera, Javier	Universidad De Zaragoza
Fischer, Tobias	Queensland University of Technology
Milford, Michael J	Queensland University of Technology
Keywords: SLAM, Localization Abstract: Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation—all accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions.

10:35-10:40, Paper WeAT29.2
LOG-SLAM: Large-Scale Outdoor Gaussian SLAM for Dense Mapping and Loop Closure in Kilometer-Scale Scene Reconstruction

Wang, Long	Beijing University of Posts and Telecommunications
Liu, Haosong	Beijing University of Posts and Telecommunications
Luo, Haiyong	Institute of Computing Technology, Chinese Academy of Sciences
Zhao, Fang	Beijing University of Posts and Telecommunications
Chen, Runze	Beijing University of Posts and Telecommunications
Chen, Yushi	Beijing University of Posts and Telecommunications
Yan, Jiaquan	Beijing University of Posts and Telecommunications
Luo, Dan	Beijing Forestry University
Keywords: SLAM, Visual Tracking, Mapping Abstract: The success of 3D Gaussian splatting in 3D reconstruction has recently led to efforts to integrate it with SLAM systems. However, most existing research has focused on indoor tracking and mapping, while outdoor Gaussian SLAM methods still heavily rely expensive LiDAR sensor. To address these challenges, we propose LOG-SLAM, a novel method for large-scale outdoor tracking and mapping using Gaussian Splatting. Our approach supports tracking through monocular or visual-inertial input, progressively constructing the 3D Gaussian map from depth and pose estimates obtained during the tracking process. Additionally, we introduce a submap-based strategy for managing large-scale maps, enabling the reconstruction of kilometer-scale environments. A loop closure detection module is also incorporated to reduce accumulated errors. Furthermore, we present a novel dynamic object removal method based on rendering loss that mitigates the interference of dynamic objects on the reconstruction. Our experiments on KITTI and KITTI-360 demonstrate that our method achieves localization performance comparable to traditional SLAM systems, while outperforming recent GS/NeRF-based SLAM approaches in terms of mapping and rendering quality.

10:40-10:45, Paper WeAT29.3
CSVO: Complementary-Pathway Spatial-Enhanced Visual Odometry for Extreme Environments with Brain-Inspired Vision Sensors

Lin, Yihan	Tsinghua University
Zhang, Zhaoxi	Beijing Institute of Technology
Chen, Yuguo	Tsinghua University
Wang, Taoyi	Tsinghua University
Zhao, Rong	Tsinghua University
Keywords: SLAM, Visual-Inertial SLAM, Data Sets for SLAM Abstract: Visual Odometry (VO) estimates the pose and motion trajectory of the camera based on visual input, serving as a fundamental technique for robotic positioning and navigation. However, existing VO methods face challenges in visual degradation in extreme environments, e.g., high dynamic range or fast-motion conditions. Although event-based sensing schemes offer partial solutions to this problem, they are limited by unstable features and noise. Recently, a novel brain-inspired vision sensor, Tianmouc, has been reported, incorporating two complementary pathways: a cognition-oriented pathway (COP) for precise color intensity and an action-oriented pathway (AOP) for fast spatiotemporal sensing, considered a promising visual input for VO tasks. Here, we develop Complementary Pathway Spatial Enhanced Visual Odometry (CSVO) to cope with extreme scenarios by fusing the COP and AOP information of Tianmouc. To leverage the dynamic range expansion brought about by dual-pathway fusion, as well as the low-latency spatial difference data in AOP to address high-speed motion, we design an asynchronous dual-pathway feature encoder considering synchronous multimodal fusion and asynchronous cross-modal feature matching. To train and evaluate CSVO, we transform two conventional VO datasets, TartanAir and Apollo, to Tianmouc modality through simulation and collect a real-world Tianmouc-VO dataset in challenging scenes. Our results demonstrate state-of-the-art performance over existing methods on these datasets. Our work sheds light on the generalizability of agents working in extreme scenarios. The codes and data sets are available at https://github.com/Tianmouc/CSVO.

10:45-10:50, Paper WeAT29.4
Semantic Enhancement for Object SLAM with Heterogeneous Multimodal Large Language Model Agents

Hong, Jungseok	MIT
Choi, Ran	Massachusetts Institute of Technology
Leonard, John	MIT
Keywords: SLAM, Semantic Scene Understanding, Agent-Based Systems Abstract: Object Simultaneous Localization and Mapping (SLAM) systems struggle to correctly associate semantically similar objects in close proximity, especially in cluttered indoor environments and when scenes change. We present Semantic Enhancement for Object SLAM (SEO-SLAM), a novel framework that enhances semantic mapping by integrating heterogeneous multimodal large language model (MLLM) agents. Our method enables scene adaptation while maintaining a semantically rich map. To improve computational efficiency, we propose an asynchronous processing scheme that significantly reduces the agents' inference time without compromising semantic accuracy or SLAM performance. Additionally, we introduce a multi-data association strategy using a cost matrix that combines semantic and Mahalanobis distances, formulating the problem as a Linear Assignment Problem (LAP) to alleviate perceptual aliasing. Experimental results demonstrate that SEO-SLAM consistently achieves higher semantic accuracy and reduces false positives compared to baselines, while our asynchronous MLLM agents significantly improve processing efficiency over synchronous setups. We also demonstrate that SEO-SLAM has the potential to improve downstream tasks such as robotic assistance. Our dataset is publicly available at: jungseokhong.com/SEO-SLAM.

10:50-10:55, Paper WeAT29.5
VSG-SLAM: A Dense Visual Semantic SLAM with Gaussian Splatting

Tong, Wenyuan	Zhejiang University
Dai, Kai	TEEMO Technology Co., Ltd
Zeng, Limin	Zhejiang University
Keywords: SLAM, Vision-Based Navigation Abstract: Simultaneous Localization and Mapping (SLAM) is critical for real-time robotic applications, enabling precise localization and comprehensive scene reconstruction. Recent advances in 3D Gaussian Splatting (3DGS) enable high-quality view synthesis and rapid rendering, yet robust and consistent semantic scene representation remains under-explored. In this work, we introduce a dense semantic SLAM framework that integrates high-dimensional semantic features with an explicit 3D Gaussian-based scene representation to address these challenges. Our approach employs a lightweight projection layer that maps low-dimensional semantic features to high-dimensional embeddings, a coarse-to-fine and semantically informed camera tracking strategy that robustly estimates camera poses, a mapping module that incrementally refines the Gaussian map by simultaneously leveraging geometric and photometric cues alongside semantic information and a covisibility based local bundle adjustment module for joint optimization of camera poses and Gaussian parameters. Extensive experiments on synthetic and real-world indoor datasets demonstrate that our framework achieves superior reconstruction quality, enhanced semantic segmentation accuracy, and competitive camera pose estimation.

10:55-11:00, Paper WeAT29.6
Reducing Redundancy in VSLAM: VLMs-Driven Keyframe Selection Using Multi-Dimensional Semantic Information

Huo, Xiang	Guangdong University of Technology
Chen, Shilang	Guangdong University of Technology
Zhu, Lei	University of Macau
Zhu, Haifei	Guangdong University of Technology
Guan, Yisheng	Guangdong University of Technology
Zhang, Hong	SUSTech
Chen, Weinan	Guangdong University of Technology
Keywords: SLAM, Visual Learning, AI-Based Methods Abstract: Keyframe selection plays a crucial role in balancing computational efficiency and localization accuracy in Visual Simultaneous Localization and Mapping (VSLAM) systems. Existing keyframe selection methods often struggle to capture high-level semantic information in environments where multiple semantic dimensions interact. In this paper, we propose the multi-dimensional semantic analysis (MSA) module based on Visual-Language Models (VLMs). By leveraging the capability of VLMs to extract rich semantic features, we compute the similarity between each image frame and a set of textual descriptions, generating a scene descriptor that quantifies the semantic distance between frames across multiple dimensions (e.g., object count, texture, and lighting). We then introduce the scene change assessment (SCA) module based on Bayesian Online Changepoint Detection (BOCD), which identifies keyframes with significant semantic information gain, thereby reducing the total number of keyframes. Extensive experiments on an open dataset demonstrate that our method not only significantly reduces the number of keyframes but also maintains high localization accuracy. Furthermore, the inference speed of the MSA module satisfies the real-time requirements of VSLAM. These results underscore the potential of our approach to enhance the efficiency of keyframe selection.

11:00-11:05, Paper WeAT29.7
SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image Via 3D Gaussian Splatting

Yang, Linqi	State Key Laboratory of Robotics and System, Harbin Institute Of
Zhao, Xiongwei	Harbin Institute of Technology
Sun, Qihao	Harbin Institute of Technology
Wang, Ke	Harbin Institute of Technology
Chen, Ao	Harbin Institute of Technology
Kang, Peng	Harbin Institute of Technology(Shen Zhen)
Keywords: SLAM, Visual Learning, Mapping Abstract: 6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, an novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

11:05-11:10, Paper WeAT29.8
MSPA-LIO: LiDAR-Inertial Odometry with Multi-Scale Plane Adjustment

Yan, Su	Northeastern University
Zhao, Shuying	Northeastern University
Zhang, Yunzhou	Northeastern University
Ding, Hengwang	Northeastern University
Li, Wu	Northeastern University
Wang, Sizhan	Northeastern University
Wu, Song	Northeastern University
Keywords: SLAM, Mapping Abstract: Most current LiDAR-based odometry methods use point-to-local plane registration to constrain poses, ignoring the explicit plane structure in the environment. Due to noise interference and uneven distribution of point cloud, local planes are prone to tilt, resulting in registration errors. Therefore, we propose MSPA-LIO, a LiDAR-Inertial odometry with multi-scale plane adjustment, which uses geometric constraints and plane adjustment at both local voxel plane scale and large plane scale to improve odometry accuracy and enhance map consistency. In order to make full use of the planar structure in the environment, we propose an explicit large plane extraction method based on the voxel-based. We use large planes to correct the direction of the associated voxel planes, thereby overcoming the misregistration problem caused by local plane tilt. To further improve the odometry accuracy, we perform plane adjustments at the voxel plane scale and the large plane scale to make the pose and map more consistent. Experiments conducted on the VECtor Dataset and the Newer College Dataset demonstrate that our proposed algorithm outperforms four state-of-the-art algorithms.


WeAT30	106
Aerial Systems: Mechanics and Control 1	Regular Session
Chair: Miao, Zhiqiang	Hunan University
Co-Chair: Li, Zhan	Harbin Institute of Technology

10:30-10:35, Paper WeAT30.1
Manipulation of Elasto-Flexible Cables with Single or Multiple UAVs

Gabellieri, Chiara	University of Twente
Teeuwen, Lars	University of Twente
Shen, Yaolei	University of Twente
Franchi, Antonio	University of Twente / Sapienza University of Rome
Keywords: Aerial Systems: Mechanics and Control Abstract: This work considers a large class of systems composed of multiple quadrotors manipulating deformable and extensible cables. The cable is described via a discretized representation, which decomposes it into linear springs interconnected through lumped-mass passive spherical joints. Sets of flat outputs are found for the systems. Numerical simulations support the findings by showing cable manipulation relying on flatness-based trajectories. Eventually, we present an experimental validation of the effectiveness of the proposed discretized cable model for a two-robot example. Moreover, a closed-loop controller based on the identified model and using cable-output feedback is experimentally tested.

10:35-10:40, Paper WeAT30.2
TACO: General Acrobatic Flight Control Via Target-And-Command-Oriented Reinforcement Learning

Yin, Zikang	Westlake University
Zheng, Canlun	Westlake University
Guo, Shiliang	WestlakeUniversity
Wang, Zhikun	Westlake University
Zhao, Shiyu	Westlake University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications Abstract: Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. We also propose a spectral normalization method with input-output rescaling to enhance the policy's temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its ability to achieve high-speed, high-accuracy circular flights and continuous multi-flips.

10:40-10:45, Paper WeAT30.3
Design and Implement of Large-Scale Tail-Sitter VTOL UAV

Xu, Zhixiong	Zhejiang University
Hu, Yixin	Zhejiang University, College of Control Science and Engineering
Wen, Guangwei	Huzhou Institude of Zhejiang University
Xu, Chao	Zhejiang University
Fan, Li	Huzhou Institude of Zhejiang University, Zhejiang University
Keywords: Aerial Systems: Mechanics and Control Abstract: We present the design and implementation of a large-scale tail-sitter aircraft with vertical takeoff and landing (VTOL) capabilities. The aircraft was designed with an H-shaped configuration and incorporated canards to enhance longitudinal stability. The structural design was optimized to balance mechanical strength and lightweight construction. Furthermore, the actuation system comprised four motors and eight servos, providing the aircraft with high maneuverability. Computational fluid dynamics (CFD) simulations and wind tunnel tests were conducted to characterize the full envelope aerodynamics of the aircraft. To address the engineering challenges associated with increased scale, we developed a control framework applicable to the entire flight envelope. The performance and reliability of the proposed system were validated by extensive simulation studies and outdoor flight experiments.

10:45-10:50, Paper WeAT30.4
Design and Control of a 6-DOF Fully Actuated Aerial-Aquatic Robot with Thrust Vectoring

Tian, Bocheng	Beihang University
Liu, Yuchen	Beihang University
Ren, Xiangyu	Beihang University
Chen, Donghe	Beihang University
Zuo, Zonghao	Beihang University
Wen, Li	Beihang University
Keywords: Aerial Systems: Mechanics and Control, Marine Robotics, Aerial Systems: Applications Abstract: Single-medium, multi-degree-of-freedom robots often face limitations in aerial-aquatic tasks due to structural and weight constraints, which compromise their mobility in both air and water. To address this, we introduce a 6-degree-of-freedom fully actuated aerial-aquatic robot that employs thrust vectoring for enhanced performance. This innovative design incorporates four servos and four motors to facilitate coordinated operation. In air mode, the robot achieves decoupled control of attitude and position through servo angle feedforward compensation combined with dual-loop control. In underwater mode, it ensures high maneuverability by utilizing a dynamic model similar to a “weightless” state, employing single-loop control. Experimental results demonstrate that the robot can perform fully actuated movements in both air and water, successfully navigate the air-water boundary, and deploy sensors on inclined surfaces. These capabilities highlight the robot’s significant future application prospects.

10:55-11:00, Paper WeAT30.6
Design of a Six-Bar Linkage-Inspired Reversible Wing for Stopped-Rotor Vehicles

Hilby, Kristan	Massachusetts Institute of Technology
Hughes, Max	Northwestern University
Hunter, Ian	MIT
Keywords: Aerial Systems: Mechanics and Control, Mechanism Design, Aerial Systems: Applications Abstract: Reversible morphing wings, capable of exchanging the direction of the leading and trailing edge, improve the feasibility of stopped-rotor aerial vehicles and thereby expand aerial robotics capabilities. Despite the potential benefits of reversible morphing wings, few designs exist and most do not consider the coupled aerodynamic and structural effects of the introduced morphing mechanisms. We report a six-bar linkage-inspired reversible morphing wing design that is compliant, yet structurally rigid against aerodynamic loading. Compared to alternative methods for reversing airfoil direction, the developed wing increases the maximum aerodynamic performance by 50%, while also yielding a non-zero efficiency at 0 degree angle of attack. The proposed linkage system derives rigidity by strategically eliminating degrees of freedom upon actuation. To validate the linkage theory of rigidity, the full wing is studied under one-way fluid-structural interaction simulations. Furthermore, we demonstrate how constrained optimization can be applied to improve the aerodynamic efficiency of these wings subject to the constraints of a six-bar linkage system.

11:00-11:05, Paper WeAT30.7
Learning Robust Agile Flight Control with Stability Guarantees

Pries, Lukas	Technical University of Munich (TUM)
Ryll, Markus	Technical University Munich
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Machine Learning for Robot Control Abstract: In the evolving landscape of high-speed agile quadrotor flight, achieving precise trajectory tracking at the platform's operational limits is paramount. Controllers must handle actuator constraints, exhibit robustness to disturbances, and remain computationally efficient for safety-critical applications. In this work, we present a novel neural-augmented feedback controller for agile flight control. The controller addresses individual limitations of existing state-of-the-art control paradigms and unifies their strengths. We demonstrate the controller's capabilities, including the accurate tracking of highly aggressive trajectories that surpass the feasibility of traditional actuators. Notably, the controller provides universal stability guarantees, enhancing its robustness and tracking performance even in exceedingly disturbance-prone settings. Its nonlinear feedback structure is highly efficient enabling fast computation at high update rates. Moreover, the learning process in simulation is both fast and stable, and the controller's inherent robustness allows direct deployment to real-world platforms without the need for training augmentations or fine-tuning.

11:05-11:10, Paper WeAT30.8
WLuav: An Air-Ground Robot with High Ground Adaptability and Trajectory Tracking Performance

Huang, Shijie	Hunan University
Miao, Zhiqiang	Hunan University
Niu, Chuanpeng	Hunan University
Liu, Kangcheng	Hunan University (HNU); Previously with the California Institute
Wang, Yaonan	Hunan University
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Legged Robots Abstract: Air-ground robots have received more and more attention and applications due to their air-to-ground motion performance and excellent energy efficiency. However, air-ground robots have many gaps including complex structure mechanisms, low terrain adaptability and low-precision controllers to significantly limit practical application. In this paper, an air-ground robot, wheel-leg unmanned aerial vehicle (WLuav), is proposed based on a five-link wheel leg structure to obtain excellent ground adaptive and air maneuvering capabilities. Based on the improved structure mechanism, a hierarchical adaptive agile controller is proposed to improve its trajectory tracking accuracy and ground adaptability. Besides, a mode switching strategy based on the support force solver is proposed to provide smooth and rapid mode switching. Finally, comprehensive experiments and a benchmark comparison are carried out to validate the performance of the proposed system, where the WLuav system shows excellent ground adaptive performance and trajectory tracking performance, and the energy efficiency can reach 79.46%.


WeBT1	401
Sensor Fusion & SLAM 1	Regular Session
Co-Chair: Zhou, Yi	Hunan University

13:20-13:25, Paper WeBT1.1
A Comprehensive Evaluation of LiDAR Odometry Techniques

Potokar, Easton	Carnegie Mellon Uiversity
Kaess, Michael	Carnegie Mellon University
Keywords: Localization, Range Sensing, Sensor Fusion Abstract: Light Detection and Ranging (LiDAR) sensors have become the sensor of choice for many robotic state estimation tasks. Because of this, in recent years there has been significant work done to ﬁnd the most accurate method to perform state estimation using these sensors. In each of these prior works, an explosion of possible technique combinations has occurred, with each work comparing LiDAR Odometry (LO) “pipelines” to prior “pipelines”. Unfortunately, little work up to this point has performed the significant amount of ablation studies comparing the various building-blocks of a LO pipeline. In this work, we summarize the various techniques that go into defining a LO pipeline and empirically evaluate these LO components on an expansive number of datasets across environments, LiDAR types, and vehicle motions. Finally, we make empirically-backed recommendations for the design of future LO pipelines to provide the most accurate and reliable performance.

13:25-13:30, Paper WeBT1.2
LHMM: A Tightly-Coupled LiDAR-Inertial Hybrid-Map Matching Approach for Robust and Efficient Global Localization

Lu, Junyuan	ZheJiang University
Wu, Qishu	Zhejiang University
Zhang, Yu	Zhejiang University
Keywords: Localization, SLAM, Sensor Fusion Abstract: LiDAR map matching (LMM) faces two key challenges: the enormous number of point clouds imposes constraints on storage and computation, and traditional two-stage frameworks suffer from initial guess errors during degeneration. This paper presents LHMM, a hybrid-map framework that first compresses the prior map and then performs tightly coupled pose estimation within a Maximum A Posteriori (MAP) estimation formulation. First, a skeletonization-based prior map compression method is proposed, which retains only stable structural features, reducing the map storage while enabling fast runtime association through a dual-mode map representation. Second, constraints from IMU, skeleton-feature prior map, and local voxel map are jointly optimized within a unified MAP formulation, recovering the full system state in a single step and preventing error cascades. The local map benefits from a hole-aware keyframe mechanism, focusing on regions with environmental changes or areas with partial map coverage, thereby reducing computation compared to full mapping. Extensive evaluations across multiple datasets demonstrate that LHMM not only reduces storage and computational overhead but also outperforms state-of-the-art methods in terms of localization accuracy and robustness. We will open-source the code.

13:30-13:35, Paper WeBT1.3
Real-Time Initialization of Unknown Anchors for UWB-Aided Navigation

Delama, Giulio	University of Klagenfurt
Borowski, Igor Józef	Universität Klagenfurt
Jung, Roland	University of Klagenfurt
Weiss, Stephan	Universität Klagenfurt
Keywords: Localization, Sensor Fusion, Autonomous Vehicle Navigation Abstract: This paper presents a framework for the real-time initialization of unknown Ultra-Wideband (UWB) anchors in UWB-aided navigation systems. The method is designed for localization solutions where UWB modules act as supplementary sensors. Our approach enables the automatic detection and calibration of previously unknown anchors during operation, removing the need for manual setup. By combining an online Positional Dilution of Precision (PDOP) estimation, a lightweight outlier detection method, and an adaptive robust kernel for non-linear optimization, our approach significantly improves robustness and suitability for real-world applications compared to state-of-the-art. In particular, we show that our metric which triggers an initialization decision is more conservative than current ones commonly based on initial linear or non-linear initialization guesses. This allows for better initialization geometry and subsequently lower initialization errors. We demonstrate the proposed approach on two different mobile robots: an autonomous forklift and a quadcopter equipped with a UWB-aided Visual-Inertial Odometry (VIO) framework. The results highlight the effectiveness of the proposed method with robust initialization and low positioning error. We open-source our code in a C++ library including a ROS wrapper.

13:35-13:40, Paper WeBT1.4
Energy-Constrained Multi-Robot Exploration for Autonomous Map Building

Karumanchi, Sambhu Harimanas	University of Illinois, Urbana-Champaign
Rokaha, Bhagawan	Mitsubishi Electric Corporation
Schperberg, Alexander	Mitsubishi Electric Research Laboratories
P. Vinod, Abraham	Mitsubishi Electric Research Laboratories
Keywords: Multi-Robot SLAM, Mapping, Environment Monitoring and Management Abstract: We consider the problem of building the map of an unknown environment using multiple mobile robots that have physical limitations arising from dynamics and a limited onboard battery. We consider the setting where the unknown environment has a set of charging stations that the robots must discover and visit often to recharge their battery during the map building process. We propose an iterative approach to solve the resulting energy-constrained multi-robot exploration problem. Our approach uses a combination of frontier-based exploration, graph-based path planning, and multi-robot task assignment. We show that our algorithm admits a computationally inexpensive implementation that enables rapid replanning, and propose sufficient conditions for recursive feasibility and finite-time termination. We validate our approach in several Gazebo-based realistic simulations.

13:40-13:45, Paper WeBT1.5
TWC-SLAM: Multi-Agent Cooperative SLAM with Text Semantics and WiFi Features Integration for Similar Indoor Environments

Li, Chunyu	Sun Yat-Sen University
Chen, Shoubin	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Li, Dong	University of Macau
Weixing, Xue	South China Agricultural University
Li, Qingquan	Shenzhen University
Keywords: Multi-Robot SLAM, Semantic Scene Understanding, Sensor Fusion Abstract: Multi-agent cooperative SLAM often encounters challenges in similar indoor environments characterized by repetitive structures, such as corridors and rooms. These challenges can lead to significant inaccuracies in shared location identification when employing point cloud-based techniques. To mitigate these issues, we introduce TWC-SLAM, a multi-agent cooperative SLAM framework that integrates text semantics and WiFi signal features to enhance location identification and loop closure detection. TWC-SLAM comprises a single-agent front-end odometry module based on FAST-LIO2, a location identification and loop closure detection module that leverages text semantics and WiFi features, and a global mapping module. The agents are equipped with sensors capable of capturing textual information and detecting WiFi signals. By correlating these data sources, TWC-SLAM establishes a common location, facilitating point cloud alignment across different agents’ maps. Furthermore, the system employs loop closure detection and optimization modules to achieve global optimization and cohesive mapping. We evaluated our approach using an indoor dataset featuring similar corridors, rooms, and text sign. The results demonstrate that TWC-SLAM significantly improves the performance of cooperative SLAM systems in complex environments with repetitive architectural features.

13:45-13:50, Paper WeBT1.6
SaWa-ML: Structure-Aware Pose Correction and Weight Adaptation-Based Robust Multi-Robot Localization

Choi, Junho	KAIST
Ryoo, Kihwan	Korea Advanced Institute of Science and Technology
Kim, Jeewon	School of Electrical Engineering, KAIST
Kim, Taeyun	KAIST
Lee, Eungchang Mason	Korea Advanced Institute of Science and Technology
Jeong, Myeongwoo	KAIST
Marsim, Kevin Christiansen	KAIST
Lim, Hyungtae	Massachusetts Institute of Technology
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Multi-Robot SLAM, Multi-Robot Systems, Localization Abstract: Multi-robot localization is a crucial task for implementing multi-robot systems. Numerous researchers have proposed optimization-based multi-robot localization methods that use camera, IMU, and UWB sensors. Nevertheless, characteristics of individual robot odometry estimates and distance measurements between robots used in the optimization are not sufficiently considered. In addition, previous researches were heavily influenced by the odometry accuracy that is estimated from individual robots. Consequently, long-term drift error caused by error accumulation is potentially inevitable. In this paper, we propose a novel visual-inertial-range-based multi-robot localization method, named SaWa-ML, which enables geometric structure-aware pose correction and weight adaptation-based robust multi-robot localization. Our contributions are twofold: (i)~we leverage UWB sensor data, whose range error does not accumulate over time, to first estimate the relative positions between robots and then correct the positions of each robot, thus reducing long-term drift errors, (ii)~we design adaptive weights for robot pose correction by considering the characteristics of the sensor data and visual-inertial odometry estimates. The proposed method has been validated in real-world experiments, showing a substantial performance increase compared with state-of-the-art algorithms.

13:50-13:55, Paper WeBT1.7
ActiveGS: Active Scene Reconstruction Using Gaussian Splatting

Jin, Liren	University of Bonn
Zhong, Xingguang	University of Bonn
Pan, Yue	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Popovic, Marija	TU Delft
Keywords: Mapping, View Planning for SLAM, RGB-D Perception Abstract: Robotics applications often rely on scene reconstructions to enable downstream tasks. In this work, we tackle the challenge of actively building an accurate map of an unknown scene using an on-board RGB-D camera on a mobile platform. We propose a hybrid map representation that combines a Gaussian splatting map with a coarse voxel map, leveraging the strengths of both representations: the high-fidelity scene reconstruction capabilities of Gaussian splatting and the spatial modelling strengths of the voxel map. At the core of our framework is an effective confidence modelling technique for the Gaussian splatting map to identify under-reconstructed areas, while utilising spatial information from the voxel map to target unexplored areas and assist in collision-free path planning. By actively collecting scene information in under-reconstructed and unexplored areas for map updates, our approach achieves superior Gaussian splatting reconstruction results compared to state-of-the-art approaches. Additionally, we demonstrate the real-world applicability of our active scene reconstruction framework using an unmanned aerial vehicle.

13:55-14:00, Paper WeBT1.8
Category-Level Meta-Learned NeRF Priors for Efficient Object Mapping

Ejaz, Saad	University of Luxembourg
Bavle, Hriday	University of Luxembourg
Ribeiro, Laura	University of Luxembourg
Voos, Holger	University of Luxembourg
Sanchez-Lopez, Jose Luis	University of Luxembourg
Keywords: Mapping, Object Detection, Segmentation and Categorization, RGB-D Perception Abstract: In 3D object mapping, category-level priors enable efficient object reconstruction and canonical pose estimation, requiring only a single prior per semantic category (e.g., chair, book, laptop, etc.). DeepSDF has been used predominantly as a category-level shape prior, but it struggles to reconstruct sharp geometry and is computationally expensive. In contrast, NeRFs capture fine details but have yet to be effectively integrated with category-level priors in a real-time multi-object mapping framework. To bridge this gap, we introduce PRENOM, a Prior-based Efficient Neural Object Mapper that integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical object pose estimation. PRENOM gets to know objects on a first-name basis by meta-learning on synthetic reconstruction tasks generated from open-source shape datasets. To account for object category variations, it employs a multi-objective genetic algorithm to optimize the NeRF architecture for each category, balancing reconstruction quality and training time. Additionally, prior-based probabilistic ray sampling directs sampling toward expected object regions, accelerating convergence and improving reconstruction quality under constrained resources. Experimental results highlight the ability of PRENOM to achieve high-quality reconstructions while maintaining computational feasibility. Specifically, comparisons with prior-free NeRF-based approaches on a synthetic dataset show a 21% lower Chamfer distance. Furthermore, evaluations against other approaches using shape priors on a noisy real-world dataset indicate a 13% improvement averaged across all reconstruction metrics, and comparable pose and size estimation accuracy, while being trained for 5× less time. Code available at: https://github.com/snt-arg/PRENOM


WeBT2	402
Social HRI	Regular Session
Chair: Obaid, Mohammad	Chalmers University of Technology
Co-Chair: Li, Sheng	Institute of Science Tokyo

13:20-13:25, Paper WeBT2.1
AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

Wang, Xiao	University at Buffalo, SUNY
Dong, Lu	University at Buffalo, SUNY
Rangasrinivasan, Sahana	University at Buffalo, SUNY
Nwogu, Ifeoma	U BUFALLO
Setlur, Srirangaraj	University at Buffalo, SUNY
Govindaraju, Venu	University at Buffalo, SUNY
Keywords: Social HRI, AI-Enabled Robotics, Human-Centered Robotics Abstract: The social robot's open API allows users to customize open-domain interactions. However, it remains inaccessible to those without programming experience. We introduce AutoMisty, the first LLM-powered multi-agent framework that converts natural-language commands into executable Misty robot code by decomposing high-level instructions, generating sub-task code, and integrating everything into a deployable program. Each agent employs a two-layer optimization mechanism: first, a self-reflective loop that instantly validates and automatically executes the generated code, regenerating whenever errors emerge; second, human review for refinement and final approval, ensuring alignment with user preferences and preventing error propagation. To evaluate AutoMisty's effectiveness, we designed a benchmark task set spanning four levels of complexity and conducted experiments in a real Misty robot environment. Extensive evaluations demonstrate that AutoMisty not only consistently generates high-quality code but also enables precise code control, significantly outperforming direct reasoning with ChatGPT-4o and ChatGPT-o1. All code,optimized APIs, and experimental videos will be publicly released.

13:25-13:30, Paper WeBT2.2
Arena-Bench 2.0: A Comprehensive Benchmark of Social Navigation Approaches in Collaborative Environments

Shcherbyna, Volodymyr	Technical University Berlin
Kästner, Linh	National University Singapore
Nguyen, Huu Giang	Technical University Berlin
Wang, Jiaming	National University of Singapore
Do, Duc Anh	Technical University Berlin
Seeger, Tim	Technical University Berlin
Zeng, Huajian	Technical University Munich
Shen, Zhengcheng	TU Berlin
Martban, Ahmed	Technical University Berlin
Trinh, Nhan	Technical University Berlin
Wiese, Eva	Berlin Institute of Technology
Keywords: Software Tools for Benchmarking and Reproducibility, Performance Evaluation and Benchmarking, Social HRI Abstract: Social navigation has become increasingly important for robots operating in human environments, yet many newly proposed navigation methods remain narrowly tailored or exist only as proof-of-concept prototypes. Building on our previous work with Arena, a social navigation development platform, we now propose, Arena-Bench 2.0 a comprehensive social nav- igation benchmark of state-of-the-art planners, fully integrated into the Arena framework. To achieve this, we developed a novel plugin structure—implemented on ROS2—to streamline the integration process and ensure straightforward, efficient workflows. As a demonstration, we integrated various learning- based and model-based navigation approaches and constructed a diverse set of social navigation scenarios to rigorously evaluate each planner. Specifically, we introduce a scenario genera- tion node that allows users to construct complex, realistic social contexts through a web-based interface. We subsequently perform an extensive benchmark of all integrated planners, assessing both navigational and social metrics. Our evaluation also considers factors such as sensor input, reaction time, and latency, enabling insights into which planner may be most appropriate under different circumstances. The findings offer valuable guidance for selecting suitable planners for specific scenarios. The code is publicly available on Github.

13:30-13:35, Paper WeBT2.3
A Rubber-Sheet Transformation Model for Personalized Human-Robot Proxemics

Camara, Fanta	University of York
El Jabaoui, Adam	Chalmers University of Technology
Mukundan, Ramakrishnan	University of Canterbury
Obaid, Mohammad	Chalmers University of Technology
Keywords: Social HRI, Human-Centered Robotics, Safety in HRI Abstract: The deployment of autonomous robots in human environments requires an understanding of social interactions and the factors that influence them. Human-robot proxemics is an important factor that impacts interactions, and modeling personalized proxemic behavior has always been a challenge, as it depends on multiple user attributes, including gender, age, and height. In this paper, we propose a novel approach that uses rubber-sheet transformation models to represent human-robot proxemics. We do this by collecting human-robot interpersonal distance data from 20 users and model it with respect to their height, age, gender, and the angle at which the robot approaches. We present an evaluation of the model, and the outcome of our results, which show a promising approximation of proxemic distances based on different user attributes. Finally, we provide a coefficient table for rubber-sheet models to lay the foundation for personalized human-robot proxemics and outline future research directions.

13:35-13:40, Paper WeBT2.4
Adaptive Gaze Modulation in Social Robots: A Reinforcement Learning Approach to Attention Regulation

Wijesinghe, Nipuni	University of Canberra
Jayasuriya, Maleen	University of Canberra
Hinwood, David Ryan	University of Canberra
Grant, Janie Busby	University of Canberra
Herath, Damith	University of Canberra
Keywords: Social HRI, Reinforcement Learning, Human-Centered Robotics Abstract: Attention serves as a critical antecedent to social presence, which fundamentally influences acceptance, trust, and overall interaction quality in human-robot interaction (HRI). This paper investigates the development of a gaze modulation framework that enables robots to strategically influence human attention through two complementary Q-learning-based modules: Gaze-Garnering Modulation (GGM) and Gaze-Avoidance Modulation (GAM). To measure gaze feedback, we introduce a novel metric—the Dynamic Gaze Engagement Index (DGEI)—that integrates attention ratio with stationary gaze entropy (SGE) to evaluate not just the quantity but also the quality of visual attention. This feedback allows the system to continuously adapt to each individual's unique attentional patterns and thresholds, providing personalised interaction. In two experiments, 20 participants interacted with a Pepper robot that dynamically adjusted its behaviours (lights, movements, and voice volume) based on real-time gaze feedback. Results demonstrated that GGM significantly enhanced gaze engagement, fostering strong mutual interaction, while GAM effectively redirected attention when appropriate, with participants reporting lower perceived gaze engagement in this condition. Post-experiment questionnaires using the "Psycho-behavioural Interaction - Perceived Attentional Engagement" section of the Network Minds Social Presence Scale revealed significant differences between conditions (t(18)=2.47, p=0.0238), validating the attention modulation by each module and corroborating the behavioural observations. These findings underscore the importance of adaptive robotic behaviours in facilitating dynamic and unobtrusive interactions.

13:40-13:45, Paper WeBT2.5
Towards Emotion Co-Regulation with LLM-Powered Socially Assistive Robots: Integrating LLM Prompts and Robotic Behaviors to Support Parent-Neurodivergent Child Dyads

Li, Jing	Eindhoven University of Technology
Schijve, Felix	Eindhoven University of Technology
Li, Sheng	Institute of Science Tokyo
Yang, Yuye	Utrecht University
Hu, Jun	Eindhoven University of Technology
Barakova, Emilia I.	Eindhoven University of Technology
Keywords: Social HRI, AI-Enabled Robotics, Embodied Cognitive Science Abstract: Socially Assistive Robotics (SAR) has shown promise in supporting emotion regulation for neurodivergent children. Recently, there has been increasing interest in leveraging advanced technologies to assist parents in co-regulating emotions with their children. However, limited research has explored the integration of large language models (LLMs) with SAR to facilitate emotion co-regulation between parents and children with neurodevelopmental disorders. To address this gap, we developed an LLM-powered social robot by deploying a speech communication module on the MiRo-E robotic platform. This supervised autonomous system integrates LLM prompts and robotic behaviors to deliver tailored interventions for both parents and neurodivergent children. Pilot tests were conducted with two parent-child dyads, followed by a qualitative analysis. The findings reveal MiRo-E 's positive impacts on interaction dynamics and its potential to facilitate emotion regulation, along with identified design and technical challenges. Based on these insights, we provide design implications to advance the future development of LLM-powered SAR for mental health applications.

13:45-13:50, Paper WeBT2.6
In-Situ Value-Aligned Human-Robot Interactions with Physical Constraints

Li, Hongtao	Hohai University
Jiao, Ziyuan	Beijing Institute for General Artificial Intelligence
Liu, Xiaofeng	Hohai University
Liu, Hangxin	Beijing Institute for General Artificial Intelligence (BIGAI)
Zheng, Zilong	BIGAI
Keywords: Task Planning, Human Factors and Human-in-the-Loop, AI-Enabled Robotics Abstract: Equipped with Large Language Models (LLMs), human-centered robots are now capable of performing a wide range of tasks that were previously deemed challenging or unattainable. However, merely completing tasks is insufficient for cognitive robots, who should learn and apply human preferences to future scenarios. In this work, we propose a framework that combines human preferences with physical constraints, requiring robots to complete tasks while considering both. Firstly, we developed a benchmark of everyday household activities, which are often evaluated based on specific preferences. We then introduced In-Context Learning from Human Feedback (ICLHF), where human feedback comes from direct instructions and adjustments made intentionally or unintentionally in daily life. Extensive sets of experiments, testing the ICLHF to generate task plans and balance physical constraints with preferences, have demonstrated the efficiency of our approach.

13:50-13:55, Paper WeBT2.7
Building Knowledge from Interactions: An LLM-Based Architecture for Adaptive Tutoring and Social Reasoning

Garello, Luca	Italian Institute of Technology and University of Genoa
Belgiovine, Giulia	Istituto Italiano Di Tecnologia
Russo, Gabriele	University of Genoa
Rea, Francesco	Istituto Italiano Di Tecnologia
Sciutti, Alessandra	Italian Institute of Technology
Keywords: Social HRI, Human-Centered Robotics, AI-Enabled Robotics Abstract: Integrating robotics into everyday scenarios like tutoring or physical training requires robots capable of adaptive, socially engaging, and goal-oriented interactions. While Large Language Models show promise in human-like communication, their standalone use is hindered by memory constraints and contextual incoherence. This work presents a multimodal, cognitively inspired framework that enhances LLM-based autonomous decision-making in social and task-oriented Human-Robot Interaction. Specifically, we develop an LLM-based agent for a robot trainer, balancing social conversation with task guidance and goal-driven motivation. To further enhance autonomy and personalization, we introduce a memory system for selecting, storing and retrieving experiences, facilitating generalized reasoning based on knowledge built across different interactions. A preliminary HRI user study and offline experiments with a synthetic dataset validate our approach, demonstrating the system’s ability to manage complex interactions, autonomously drive training tasks, and build and retrieve contextual memories, advancing socially intelligent robotics.


WeBT3	403
Soft Sensors and Actuators 2	Regular Session
Chair: Wang, Zhanwei	Vrije Universiteit Brussel
Co-Chair: Wang, Yancheng	Zhejiang University

13:20-13:25, Paper WeBT3.1
Self-Sensing Liquid Crystal Elastomer Actuator with Magnetic-Thermal Synergy

Gao, Shen	Shanghai University
Tang, Mingjun	Shanghai University
Lu, Xiao	Shanghai University
Zhou, Chenghao	ShangHai University
Zhang, Yuyin	Shanghai University
Yue, Tao	Shanghai University
Wang, Yue	Shanghai University
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Force and Tactile Sensing Abstract: Fueled by the rapid evolution of robotics, the demand for intelligent and lightweight robotic systems continues to grow across industries. However, conventional designs often separate sensing and actuation, resulting in structural complexity and diminished reliability. While integrated sensor-actuator systems offer a promising solution, they face significant challenges in manufacturing and scalability. Liquid crystal elastomer (LCE) are widely utilized in actuators for their thermally responsive deformation and programmability, while Neodymi-um-Iron-Boron (NdFeB) nanoparticles provide exceptional magnetic properties for sensing. This paper introduces a novel Self-Sensing LCE (SS-LCE) actuator, seamlessly combining LCE and NdFeB to enable simultaneous actuation and self-sensing capabilities. Under thermal stimulation, the actua-tor executes complex motions while delivering real-time feed-back through magnetic field variations. Its programmability and adaptable fabrication process support diverse motion modes, unlocking broad application potential. By enhancing integration, reliability, and flexibility, this self-sensing actuator represents a pivotal advancement in the development of light-weight, intelligent robotic systems with significant research and industrial implications.

13:25-13:30, Paper WeBT3.2
Integrating Software-Less Reflex Mechanisms into Soft Robots and a Versatile Gripper

Wang, Zhanwei	Vrije Universiteit Brussel
Huaijin, Chen	Vub
Cools, Hendrik	Vrije Universiteit Brussel (VUB)
Vanderborght, Bram	Vrije Universiteit Brussel
Terryn, Seppe	Vrije Universiteit Brussel (VUB)
Keywords: Soft Sensors and Actuators, Mechanism Design, Grippers and Other End-Effectors Abstract: While most soft pneumatic grippers that operate with a single control parameter (such as pressure or airflow) are limited to a single grasping modality, this paper introduces a new method for incorporating multiple grasping modalities into vacuum-driven soft grippers. This is achieved by combining stiffness modulation with a bistable mechanism. The system features a bistable dome structure with a central suction cup and a set of vacuum bending actuators. Designed and optimized using fluid-structure interaction (FSI) modeling in finite element analysis (FEA), it offers three grasping modes: two reflex mechanisms (force- and contact-triggered) and one with active control. All modes rely on the structural buckling of the bistable dome, but differ in how this snap behavior is activated, by force, contact, or active control. Adjusting the airflow tunes the energy barrier of the bistable mechanism, enabling changes in triggering sensitivity and allowing swift transitions between grasping modes. This results in an exceptional versatile gripper, capable of handling a diverse range of objects with varying sizes, shapes, stiffnesses, and roughnesses, controlled by a single parameter, airflow

13:30-13:35, Paper WeBT3.3
Multimodal Strain Sensing System for Shape Recognition of Tensegrity Structures by Combining Traditional Regression and Deep Learning Approaches

Mao, Zebing	Zhejiang University
Wang, Jianhui	ZHEJIANG UNIVERSITY
Meng, Zijie	Zhejiang University
Kobayashi, Ryota	Tokyo Institute of Technology
Nabae, Hiroyuki	Institute of Science Tokyo
Suzumori, Koichi	Tokyo Institute of Technology
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design Abstract: A tensegrity-based system is a promising approach for dynamic exploration of uneven, unpredictable, and confined environments. However, implementing such systems presents challenges in state recognition. In this study, we introduce a 6-strut tensegrity structure integrated with 24 multimodal strain sensors, employing a deep learning model to achieve smart tensegrity. By using conductive flexible tendons and leveraging a long short-term memory (LSTM) model, the system accomplishes self-shape reconstruction without the need for external sensors. The sensors operate in two modes, and we applied both a curve fitting model and an LSTM model to establish the relationship between length change and resistance change in the sensors. Our key findings demonstrate that the intelligent tensegrity system can accurately self-detect and adapt its shape. Furthermore, a human pressing process allows users to monitor and understand the tensegrity's shape changes based on the integrated models. This intelligent tensegrity-based system with self-sensing tendons showcases significant potential for future exploration, making it a versatile tool for real-world applications.

13:35-13:40, Paper WeBT3.4
Vi2TaP: A Cross-Polarization Based Mechanism for Perception Transition in Tactile-Proximity Sensing with Applications to Soft Grippers

Nguyen, Nhan Huu	Japan Advanced Institute of Science and Technology
Le Dinh, Minh Nhat	The University of Danang - University of Science and Technology,
Luu, Quan	Purdue University
Nguyen, Tuan	Japan Advanced Institute of Science and Technology
Ho, Van	Japan Advanced Institute of Science and Technology
Keywords: Soft Sensors and Actuators, Mechanism Design, Grippers and Other End-Effectors Abstract: Vision-based soft sensors have emerged as a promising solution for multi-modal sensory systems. Rather than relying on complex integrations of numerous specialized sensors, these devices are advantageous in achieving multiple perceptual capabilities within a single, unified device. However, the unsolved bottleneck for existing systems lies in preventing perceptual interference among visual fields and establishing a reliable mechanism for switching between perception domains. In this study, we present Vi2TaP, a novel mechanism for transitioning between perception domains in a multi-modal tactile-proximity sensing paradigm, leveraging the Cross-polarization phenomenon. The core concept involves a specific configuration in which two polarizer films are placed back-to-back. By adjusting the Plane of Polarization (PoP) between 0 and 90 degrees, the camera can either fully open its Field-of-View (FoV) to the external environment for vision/proximity sensing or restrict it to the internal space between the polarizers, tailored for tactile sensing. We showcase the first implementation of Vi2TaP to create soft sensorized fingertips for a gripper. Additionally, we introduce efficient learning pipelines for both proximity and tactile perception, along with effective strategies for extracting valuable information. The experiment results have demonstrated the advantages of the proposed multi-modal sensing scheme in executing grasping and manipulating tasks. This mechanism is anticipated to accelerate the development and adoption of multi-modal vision-based soft sensors across a wide range of practical applications.

13:40-13:45, Paper WeBT3.5
Wearable Gait Detection Device by Perception of Proximity and Pressure at the Knee (I)

Liu, Weijie	Zhejiang University
Wang, Shihang	Zhejiang University
Mei, Deqing	Zhejiang University
Wang, Yancheng	Zhejiang University
Keywords: Soft Sensors and Actuators, Sensor Fusion, Human Detection and Tracking Abstract: Wearable long-term gait detection is of great significance for healthcare, rehabilitation, and exercise training. Current wearable gait detection devices are generally less flexible, making them unsuitable for daily and long-term usage. Furthermore, they can only extract limited gait features, such as step count, which hampers their practical applications. This study introduces a novel gait detection approach that uniquely combines flexible contactless proximity sensing with contact pressure sensing at the knee, enabling the detection of leg swings and measurement of swing force during walking. A flexible, wireless, multifunctional integrated electronic device is simultaneously developed for daily wearable gait detection, which is generally lightweight and fully-flexible, ensuring comfortable wear. This innovative method not only captures traditional step characteristics but also provides insights into spatial gait characteristics such as: stride length and spatial leg posture, which have previously been challenging to identify using flexible gait detection devices. Our extensive multi-scenario and long-term monitoring experiments demonstrate the potential of this flexible device for daily wearable gait assessment, representing a significant advancement in the field.

13:45-13:50, Paper WeBT3.6
Safe Lattice Planning for Motion Planning with Dynamic Obstacles

Wiman, Emil	Linköping University
Tiger, Mattias	AI and Integrated Computer Systems (AIICS), Linköping University
Keywords: Task and Motion Planning, Motion and Path Planning, Human-Aware Motion Planning Abstract: Motion planning in dynamic and uncertain real-world environments remains a critical challenge in robotics, as it is essential for the effective operation of autonomous systems. One strategy for motion planning has been to introduce a state lattice where pre-computed motion primitives can be combined with graph-based search methods to find a physically feasible motion plan. However, introducing lattice planning into dynamic, uncertain settings remains challenging. It is non-trivial to incorporate uncertain dynamic information into the planning process in real time. Thus, in this paper we propose a lattice planning framework for dynamic environments with extensions to handle safety-critical edge-cases that can arise with the uncertain nature of the environment. The proposed method, Safe Lattice Planner (SLP), extends the Receding-Horizon Lattice Planner (RHLP) with enhanced replanning and survival capabilities to handle the dynamic habitat. We thoroughly evaluate SLP in a new benchmark suite against provided baselines. SLP is found to outperform the baselines in terms of safety and resilience in the dynamic environment while reaching the goal state in an efficient manner. We release the benchmark and SLP to accelerate the field of safe robotics.

13:50-13:55, Paper WeBT3.7
Grasping State Analysis of Soft Manipulator Based on Flexible Tactile Sensor and High-Dimensional Fuzzy System (I)

Wang, Haoyuan	Huazhong University of Science and Technology
Huang, Jian	Huazhong University of Science and Technology
Ru, Hongge	Huazhong University of Science and Technology
Fu, Zhongzheng	Huazhong University of Science and Technology
Lei, HongLiang	Huazhong University of Science and Technology
Wu, Hao	Huazhong University of Science and Technology
Wu, Dongrui	Huazhong University of Science and Technology
Keywords: Soft Sensors and Actuators, Grasping, Soft Robot Applications Abstract: The analysis of grasping states in soft manipulators is crucial but is currently lacking in research. In this article, we propose a method for analyzing grasping states in soft manipulators. A flexible sensor with rapid response,excellent repeatability, and customizable size and shape is designed. We developed a soft manipulator with this sensor, named Soft Manipulator Based on Flexible Tactile Sensor Arrays (SM-FTSA). Four grasping states, including inflation, shaking, stable, and slipping, are proposed in this study. A high-dimensional Takagi–Sugeno–Kang (TSK) fuzzy system based on switchable normalization, referred to as SR-HTSK, is proposed for grasping state classification. Static and dynamic experimental scenarios are designed to evaluate the method. The results demonstrate that the proposed method achieved high accuracy rates of 96.92% and 95.88% for the SR-HTSK method in static and dynamic experiments, respectively. Comparative analysis showed higher accuracy rates at slower movement speeds in dynamic experimental scenarios. Through ablation experiments, sensitivity analysis, and comparison with other machine learning methods and research, we validate the outstanding performance of the proposed SR-HTSK method in analyzing grasping states for soft manipulators.


WeBT4	404
Surgical Robotics: Planning	Regular Session
Chair: Li, Zheng	The Chinese University of Hong Kong
Co-Chair: Gao, Anzhu	Shanghai Jiao Tong University

13:20-13:25, Paper WeBT4.1
Neural Network Control Method for Target Tracking of Magnetically Actuated Capsule Endoscopic Robots with Obstacle Avoidance and Noise-Resistant Capabilities

Cui, Zhiwei	The Chinese University of Hong Kong
Sun, Yichong	The Chinese University of Hong Kong
Han, Dongming	The Chinese University of Hong Kong
Chiu, Philip, Wai-yan	Chinese University of Hong Kong
Li, Zheng	The Chinese University of Hong Kong
Keywords: Surgical Robotics: Planning, Medical Robots and Systems, Neural and Fuzzy Control Abstract: Magnetically actuated capsule endoscopic robots (MACERs) are becoming increasingly popular because they can reach deep diseased regions inside the body that are difficult or inaccessible to traditional endoscopes without the restriction of mechanical transmission medium. However, MACERs are highly nonlinear, hence achieving obstacle avoidance, safe, and stable target tracking control of MACERs remains a challenging research topic. Therefore, in order to satisfy the diagnosis and treatment needs of the deep diseased regions inside the body, this paper designs a MACER target tracking neural network control method with obstacle avoidance and noise-resistant capabilities. Firstly, the kinematics and obstacle avoidance model of the MACER are established, and then a moving target tracking control scheme of robot with joint motion constraints and obstacle avoidance capabilities is designed. Next, a noise-resistant neural network is designed to quickly solve the control scheme of the MACER, thereby achieving safe, obstacle avoidance, and stable target tracking control of the MACER. Finally, the effectiveness and practicability of the proposed method are checked by simulation analysis and experiment on MACER, and compared with the existing methods. The experimental results show that the neural network method proposed in this paper can effectively control the MACER to track the target motion along the gastric wall curve. Compared with existing methods, the proposed method has stronger anti-noise interference ability, the convergence accuracy of the proposed method is improved by 1.3 times, and the computational burden is reduced by 26.7 times.

13:25-13:30, Paper WeBT4.2
Model-Free Catheter Delivery Strategy for Robotic Transcatheter Tricuspid Valve Replacement

Lin, Haichuan	School of Advanced Interdisciplinary Sciences, University of Chi
Xie, Yiping	Centre for Artificial Intelligence and Robotics, Hong Kong Inst
Wang, Ziqi	Institute of Automation, Chinese Academy of Sciences
Chen, Dong	Institute of Automation, Chinese Academy of Sciences; University
Tan, Longyue	Institute of Automation, Chinese Academy of Sciences
Wang, Weizhao	King's College London
Ng, Yuen Chiu	Institute of Automation, Chinese Academy of Sciences
Hou, Xilong	Centre for Artificial Intelligence and Robotics, Hong Kong Insti
Chen, Chen	Institute of Automation, Chinese Academy of Sciences
Zhou, Xiao-Hu	Institute of Automation, Chinese Academy of Sciences
Hou, Zeng-Guang	Chinese Academy of Science
Wang, Shuangyi	Chinese Academy of Sciences
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems Abstract: Transcatheter tricuspid valve replacement (TTVR) has emerged as a promising minimally invasive procedure for treating severe tricuspid regurgitation (TR). However, accurate catheter delivery remains a significant challenge, primarily due to the reliance on 2D vision feedback, complex catheter kinematics, camera-to-robot pose calibration, which are difficult to generalize across patients. To address these issues, this paper presents a model-free robotic catheter delivery strategy for TTVR using Data-Enabled Predictive Control (DeePC). This approach leverages data-driven control to optimize catheter positioning without the need for prior knowledge of the system’s dynamics, eliminating the need for complex kinematic models or camera calibration. The proposed method incorporates environmental constraints to ensure the safety of the procedure, delivering the catheter to the desired location with high accuracy across varying catheters and camera poses. Experimental results demonstrate the effectiveness and versatility of the approach, suggesting its potential for broader applications in robotic-assisted surgeries. This work presents a new perspective for vision based robotic TTVR, as well as other clinical interventions involving robotic catheter control.

13:30-13:35, Paper WeBT4.3
Evaluating Generative Models for Inverse Kinematics of Concentric Tube Robots

Kang, Paul Hoseok	University of Toronto
Lee, Connor Derrick	University of Toronto
Nguyen, Robert Hideki	The Hospital for Sick Children
Roshanfar, Majid	Postdoctoral Research Fellow at the Hospital for Sick Children (
Looi, Thomas	Hospital for Sick Children
Podolsky, Dale	University of Toronto
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Deep Learning Methods Abstract: Concentric tube robots (CTRs) hold great potential for minimally invasive surgery, offering flexibility, small diameters, and the ability to navigate within complex anatomical structures. While machine learning models have been increasingly used to predict the kinematics of CTRs, there is a lack of an established framework for evaluating generative inverse kinematic models, which are able to solve the inverse kinematic problem by providing various joint solutions for a desired end position. In this study, we introduce a workspace-based measure to assess the diversity of solutions produced by three generative models: an invertible neural network (INN), a conditional invertible neural network (cINN), and a conditional variational autoencoder (cVAE). We find that all three models record similar end position errors (3-6 mm) on dexterous subsets of the workspace, but that a cINN outperforms the others in generating diverse solutions using a workspace-based 1-Wasserstein distance by at least 2.38 standard deviations. To further test the applicability of these models, we integrate the best-performing cINN into a CTR controller and demonstrate the first use of a generative CTR model with real-time teleoperation under task-based constraints.

13:35-13:40, Paper WeBT4.4
A Kinematics Constrained Convex Optimal Trajectory Generation Method for Robotic-Assisted Flexible Needle

Ren, Fan	Nankai University
Fang, Yongchun	Nankai University
Yu, Ningbo	Nankai University
Han, Jianda	Nankai University
Wang, Xiangyu	Nankai University
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Medical Robots and Systems Abstract: Needle puncture is a fundamental technique in minimally invasive surgical procedures. However, the limited flexibility of flexible needles and their complex interactions with tissues make it challenging to avoid critical organs along the puncture path. Preoperative path planning, which generates feasible collision-free trajectories, can effectively reduce repeated punctures and mitigate patient discomfort. To address this challenge, a flexible needle with increased maximum curvature is designed, which introduces more complex kinematic characteristics and poses greater challenges for trajectory planning under kinematic constraints. Then, for the first time, a convex feasible set (CFS)-based flexible needle trajectory planning method is developed to tackle the non-convex optimization problem posed by obstacle avoidance in unstructured surgical environments. Specifically, our method explicitly incorporates kinematic and curvature constraints, enabling direct generation of feasible trajectories without additional post-processing. Finally, comparative experiments on a self-developed robotic-assisted flexible needle system demonstrate the superior performance of the proposed algorithm. In particular, the proposed trajectory generation method allows the flexible needle to effectively avoid obstacles and accurately reach the target.

13:40-13:45, Paper WeBT4.5
FBG-Based Actuation and Data Driven Contact Detection for Smart Steerable Instruments

Mehdi, Zain	KU Leuven
Janssens, Witse	KU Leuven
Gielen, Marijn	KU Leuven
Vanderschueren, Emma	UZ Leuven
Laleman, Wim	UZ Leuven
Verslype, Chris	UZ Leuven
Ourak, Mouloud	University of Leuven
Vander Poorten, Emmanuel B	KU Leuven
Keywords: Surgical Robotics: Steerable Catheters/Needles Abstract: Catheters and guidewires are increasingly used to navigate tortuous paths offering minimal invasive access to deeply seated locations in the body. Steering these instruments is highly challenging among others due to poor awareness of the configuration such instrument takes on in the body. To address this difficulty research in physical intelligence has been conducted. To enable such smart behaviour this paper presents a compact FBG-based drive system for controlling the bending of the distal tip of a steerable catheter. The design process establishes key constraints for selecting an appropriate FBG fiber based on the selected backbone’s characteristics. Force estimation is done using the strain measured via the fiber with an RMSE of 0.05 N which is then used to train a Long Short-Term Memory (LSTM) network to detect possible contact with the surroundings. The trained model was able to predict the force with an RMSE of 0.012N. The results indicate that the proposed system incorporating FBG sensing, PAM actuation, and LSTM based contact detection offers a promising pathway for more precise and versatile catheter manipulation in minimally invasive interventions.

13:45-13:50, Paper WeBT4.6
Real-Time Distributed Force Sensing-Based Position Feedback Control for Fiber-Driven Miniaturized Continuum Robots

Xia, Jingyuan	Shanghai Jiao Tong University
Lin, Zecai	Shanghai Jiao Tong University
Yang, Junlin	Shanghai Jiao Tong University
Yang, Guang-Zhong	Shanghai Jiao Tong University
Gao, Anzhu	Shanghai Jiao Tong University
Keywords: Surgical Robotics: Steerable Catheters/Needles Abstract: Continuum robots are widely used in the medical scenarios due to their dexterity and flexibility. However, precise end-to-end control of continuum robots remains challenging, limited by the kinematic or kinetostatic accuracy and no enough space for additional sensors configurations. This paper proposes a precise position control method for fiber-driven continuum robots using the reconstructed shape based on distributed force sensing from the same fibers, where the optical fibers serve as both robot actuation and force sensing simultaneously without requiring additional sensors. First, we use single-core optical fibers (SCFs) as the actuation cables of the continuum robot, and each fiber has multiple fiber Bragg grating (FBG) sensors inscribed on it to sense distributed force along the entire cables. Then, the forward kinetostatics model of the fiber-driven continuum robot is established using the known distributed forces as the inputs. Notably, the nonlinear friction between the cables and actuation channels does not require an additional estimation model. Benefiting from this, the shape can be accurately reconstructed after the stiffness calibration of the continuum robot. Finally, a position controller based on real-time feedback from shape is developed to achieve the tip position control of the continuum robot. Experimental results demonstrate that the proposed forward kinetostatics model can achieve the shape reconstruction with the errors of 0.45 mm and 0.57 mm in planar bending and spatial bending states, respectively. By comparison to the traditional constant curvature kinematics-based control method, the proposed methods can achieve the mean absolute error of 0.37 and 0.6 mm in two distinct path tracking tests. The proposed method using distributed forces sensing enables a real-time accurate position feedback control combined with kinetostatic model, instead of modelling the nonlinear friction or adding additional external sensors.

13:50-13:55, Paper WeBT4.7
Method for Sensing Lateral Force and Skidding on the Tool Tip in Surgical Robot Deep Bone Drilling

Chen, Zheyu	Nanjing Medical University
Li, Liang	School of Biomedical Engineering and Informatics, Nanjing Medica
Keywords: Surgical Robotics: Planning, Medical Robots and Systems Abstract: Bone drilling is a critical component of many clinical surgeries. In robot-assisted deep bone drilling procedures, the complex structure of bone tissue and individual variations in drilling paths often cause slender tools to skid on personalized bone surfaces, leading to deviations that significantly impact surgical precision and safety. This paper presents the development of an orthopedic surgical robot equipped with skidding sensing capabilities. A novel sensing solution for the bone drilling unit is proposed, which employs rigid body force transmission and decouples thrust and lateral force sensing. This approach addresses the challenge of acquiring force information from the deep tool tip within the body. We also introduce a tool tip skidding estimation method based on the deflection curve model and the Spatial-Beam Constraint Model (SBCM). A specialized simulation device for measuring tool tip offset and force was designed. The experimental results demonstrate that the system achieves average sensing errors of 31.8 mN and 43.5 mN for lateral forces at the tool tip along the X and Y directions, respectively. Additionally, the system's resolution for skidding estimation reaches 0.2 mm. Real bone drilling experiments confirm the system’s ability to effectively provide feedback on skidding during surgery. The proposed method enhances the safety of orthopedic surgical robots and offers crucial sensing information for lateral forces and skidding, paving the way for future autonomous bone drilling procedures.

13:55-14:00, Paper WeBT4.8
Adjusting Tissue Puncture Omnidirectionally in Situ with Pneumatic Rotatable Biopsy Mechanism and Hierarchical Airflow Management in Tortuous Luminal Pathways

Lin, Botao	The Chinese University of Hong Kong
Zhang, Tinghua	The Chinese University of Hong Kong
Yuan, Sishen	The Chinese University of Hong Kong
Wang, Tiantian	Harbin Institute of Technology (Shenzhen)
Wang, Jiaole	Harbin Institute of Technology, Shenzhen
Yuan, Wu	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Surgical Robotics: Steerable Catheters/Needles, Hydraulic/Pneumatic Actuators, Mechanism Design Abstract: In situ tissue biopsy with an endoluminal catheter is an efficient approach for disease diagnosis, featuring low invasiveness and few complications. However, the endoluminal catheter struggles to adjust the biopsy direction by distal endoscope bending or proximal twisting for tissue sampling within the tortuous luminal organs, due to friction-induced hysteresis and narrow spaces. Here, we propose a pneumatically-driven robotic catheter enabling the adjustment of the sampling direction without twisting the catheter for an accurate in situ omnidirectional biopsy. The distal end of the robotic catheter consists of a pneumatic bending actuator for the catheter’s deployment in torturous luminal organs and a pneumatic rotatable biopsy mechanism (PRBM). By hierarchical airflow control, the PRBM can adjust the biopsy direction under low airflow and deploy the biopsy needle with higher airflow, allowing for rapid omnidirectional sampling of tissue in situ. This paper describes the design, modeling, and characterization of the proposed robotic catheter, including repeated deployment assessments of the biopsy needle, puncture force measurement, and validation via phantom tests. The PRBM prototype has six sampling directions evenly distributed across 360 degrees when actuated by a positive pressure of 0.3 MPa. The pneumatically-driven robotic catheter provides a novel biopsy strategy, potentially facilitating in situ multidirectional biopsies in tortuous luminal organs with minimum invasiveness.


WeBT5	407
Kinematics, Planning and Control 2	Regular Session
Chair: Luo, Jingbo	Ningbo Institute of Materials Technology and Engineering, CAS

13:20-13:25, Paper WeBT5.1
Enhanced Kinematic Calibration of a 4PPa-2PaR Parallel Manipulator with Subchains

Luo, Jingbo	Ningbo Institute of Materials Technology and Engineering, CAS
Chen, Silu	Ningbo Institute of Materials Technology and Engineering, CAS
Ferreira, Antoine	INSA Centre Val De Loire
He, Jianhui	Ningbo Institute of Materials Technology and Engineering, Chines
Jiang, Dexing	Ningbo Institute of Materials Technology & Engineering, CAS
Kong, Xiangjie	Ningbo Institute of Materials Technology and Engineering, Chines
Feng, Yiyang	Ningbo Institute of Material Technology & Engineering, CAS
Fang, Zaojun	Ningbo Institute of Materials Technology & Engineering, CAS
Zheng, Tianjiang	Ningbo Industrial Technology Research Institute
Zhang, Chi	Ningbo Institute of Material Technology and Engineering, CAS
Yang, Guilin	Ningbo Institute of Material Technology and Engineering, Chines
Keywords: Kinematics, Parallel Robots, Calibration and Identification Abstract: This paper proposes an innovative virtual chain-based kinematic calibration for the 4PPa-2PaR parallel manipulators with subchain architectures. Conventional calibration methods for such architectures suffer from inherent limitations due to coupled parameter constraints and restricted solution spaces caused by joint displacement and structural parameter dependencies. The presented methodology introduces three fundamental advancements: (1) a novel parameter assignment strategy enabling independent joint/link parameter definition across different kinematic chains, (2) systematic transformation of constrained optimization into an unconstrained one, and (3) significant expansion of error parameter solution space through virtual chain modeling. Comparative experiment on the physical prototype demonstrate improvements in both orientation and position accuracy compared to existing methods.

13:25-13:30, Paper WeBT5.2
Kinematic Model and Trajectory Tracking Algorithm for High-Speed Spherical Robots

Zhang, Bixuan	Zhejiang University
Hu, Tao	Zhejiang University
Guan, Xiaoqing	Zhejiang University
Chen, Haojie	Zhejiang University
Wang, You	Zhejiang University
Hao, Jie	Luoteng Hangzhou Techonlogy Co., Ltd
Li, Guang	Zhejiang University
Keywords: Kinematics, Optimization and Optimal Control, Underactuated Robots Abstract: This paper proposes a new turning theory for spherical robots, which better describes the turning mechanism of spherical robots under turning constraints, using a pendulum-driven spherical robot as an example. Compared to the previous turning theory, the new theory shows greater alignment with real-world data, especially at high speeds. Based on this new turning theory, we have constructed and optimized a new kinematic model and used this model to design a trajectory tracking algorithm that remains reliable even at high speeds. Physical experiments demonstrate that the new algorithm significantly improves trajectory tracking accuracy at high speeds. Through enhancements to the trajectory tracking algorithm, this study improves the autonomous cruising speed of spherical robots.

13:30-13:35, Paper WeBT5.3
Fast Real-Time Neural Network-Based Kinematics Solving of the Cosserat Rod Model for a Parallel Continuum Surgical Manipulator

Wu, Xipeng	Beijing Institute of Technology
Qian, Chao	Beijing Institute of Technology
Diao, Jinpeng	Beijing Institute of Technology
Duan, Xing-guang	Intelligent Robotics Institute, BeijingInstituteofTechnology
Li, Changsheng	Beijing Institute of Technology
Keywords: Kinematics, Flexible Robotics, Medical Robots and Systems Abstract: Parallel continuum mechanism offer distinct advantages in surgical instrument design, including enhanced stiffness, improved precision, and a simplified structure compared to traditional rigid or Tendon-driven systems. Conventional kinematic models based on constant curvature assumptions are often inadequate for accurately capturing the complex bending behaviors of these mechanisms. In contrast, the Cosserat rod theory provides a rigorous framework for precise kinematic modeling of flexible structures. However, its computational complexity results in slow solving speeds, particularly when dealing with spatial points that are widely separated.This paper focuses on a miniaturized parallel continuum manipulator and employs the Cosserat rod model for kinematic modeling, combined with a neural network-based inverse kinematics solver to achieve rapid real-time computation. To expedite inverse kinematics, a multilayer perceptron is trained on 5,000 samples generated from the Cosserat rod model, yielding an average absolute error of 0.046mm and a relative error of 0.41% in predicting rod lengths. Experimental validation demonstrates that the neural network solver reduces computation time to about 0.16ms compared to 700–3100ms for conventional numerical methods, underscoring its potential for enhancing the precision and responsiveness of surgical systems in minimally invasive procedures.

13:35-13:40, Paper WeBT5.4
A Partition-Learning-Selection-Augmentation (PLSA) Framework to Solve Forward Kinematics of Parallel Robots

Xiang, Ruiqi	Sun Yat-Sen University
Ye, Yongyin	Sun Yat-Sen University
Wang, Xiyu	University of California, San Diego
Xiang, Jindong	Sun Yat-Sen University
Liu, Han	The Hong Kong Polytechnic University
Li, Mengtang	Shenzhen Campus of Sun Yat-Sen University
Keywords: Parallel Robots, Kinematics, Machine Learning for Robot Control Abstract: The persistent multi-solution challenge in parallel robots' forward kinematics (FK) has impeded high-precision real-time control. Current data-driven approaches face limitations in predicting accurate and unique solutions, ensuring cross-architectural generalizability, and validating results through continuous trajectory experiments. To address these issues, this work proposes the Partition-Learning-Selection-Augmentation (PLSA) framework, which systematically resolves FK multi-solution challenges. PLSA clusters potential solutions through data partitioning, predicts all feasible solutions in parallel using deep neural networks (DNNs), integrates a selection mechanism to identify optimal solutions, and refines accuracy via the Newton-Raphson method. Cross-configuration tests on Stewart and 3-RRS parallel robots validate PLSA's adaptability to different architectures, achieving at least 98.99% accuracy and a computation speed of approximately 30Hz. Additionally, three neural networks (CNN, KAN, and Transformer) are implemented and compared in the Learning-based Selection module, demonstrating PLSA's generalizability across diverse networks. Comparative studies against analytical, numerical iterative, and prior data-driven methods confirm PLSA's unique multi-solution resolution capability, delivering submillimeter accuracy with millisecond-level computation, thus establishing a real-time FK calculation methodology.

13:45-13:50, Paper WeBT5.6
FASTNav: Fine-Tuned Adaptive Small-Language-Models Trained for Multi-Point Robot Navigation

Chen, Yuxuan	Shanghai Jiao Tong University
Han, Yixin	Shanghai Jiao Tong University
Li, Xiao	Shanghai Jiaotong University
Keywords: Integrated Planning and Learning, Autonomous Vehicle Navigation, Task Planning Abstract: With the rapid development of large language models (LLM), robots are starting to enjoy the benefits of new interaction methods that large language models bring. Because edge computing fulfills the needs for rapid response, privacy, and network autonomy, we believe it facilitates the extensive deployment of large-scale models for robot navigation and across various industries. In this paper, we propose FASTNav - a method for using lightweight LLMs, also known as small language models (SLMs), for robot navigation. The proposed method contains three modules: fine-tuning, teacher-student iteration, and language-based multi-point robot navigation. We train and evaluate models with FASTNav in both simulation and real robots, proving that we can deploy them with low cost, high accuracy and low response time. Compared to compression-type methods, FASTNav shows potential in the local deployment of large models and tends to be a promising solution for language-guided robot navigation.

13:50-13:55, Paper WeBT5.7
Inverse-Free and Data-Driven Motion Tracking Control for Redundant Robot with Fuzzy Recurrent Neural Network

Yang, Min	Hunan University
Zhu, Siying	Hunan University
Zhang, Hui	Hunan University
Keywords: Motion Control, Neural and Fuzzy Control, Redundant Robots Abstract: Precise motion tracking control with unknown structural knowledge and noise disturbance for redundant robots remains a critical and unresolved challenge. This article proposes a novel data-driven fuzzy discrete recurrent neural network (D2-FDRNN) model to address two fundamental limitations of existing models: dependency on known kinematic knowledge and fixed sampling schemes. First, a Jacobian pseudo-inverse estimator is developed to reconstruct the manipulator’s necessary kinematic knowledge using input and output data, eliminating the need for explicit Jacobian inversion. Second, a fuzzy logic-based adaptive sampling strategy dynamically adjusts the step size to balance computational efficiency and tracking precision. In addition, a Kalman filter algorithm is applied to reduce the impact of noise. Rigorous proofs confirm the model’s exponential convergence and noise immunity. To validate the proposed D2 -FDRNN model, simulations and physical experiments are carried out. The source code is available at https://github.com/YingluckZ/DD-FDRNN.git.

13:55-14:00, Paper WeBT5.8
Kinematic Control of Humanoid Upper Body Robot Using Virtual Flexible Joint Dynamics Primitive and Quasi-Sliding Mode Observer (I)

Yin, Hong	Harbin Institute of Technology
Jin, HongZhe	Harbin Institute of Technology
Ju, Fengjia	Harbin Institute of Technology
Liu, Jiaxiu	Harbin Institute of Technology
Zhao, Mingguo	Tsinghua University
Zhao, Jie	Harbin Institute of Technology
Keywords: Humanoid Robot Systems, Motion Control, Dynamics Abstract: This article presents an innovative approach to robotic kinematic control through a state-disturbance observer framework. Inspired by the dynamics of the human arm, this study introduces, for the first time, a virtual flexible joint dynamics primitive (VFJDP) and integrates it into the kinematic control of a humanoid upper body robot to generate smooth and precise movements. Two jounce-level control schemes leveraging the VFJDP are developed, along with a novel quasi-sliding mode disturbance observer for state feedback control. The VFJDP-based schemes enable precise trajectory tracking within joint angle and velocity constraints, significantly improving the robot’s manipulability and producing higher order joint commands. Theoretical analysis proves the asymptotic convergence of the observer and control algorithm. Comparative simulations show that the proposed observer improves state and disturbance estimation accuracy by over 80% compared to state-of-the-art methods. Simulations under noisy conditions further verify the robustness of the proposed approach. Furthermore, experiments involving visual servoing tasks validate the VFJDP-based schemes, achieving a 16% improvement in trajectory tracking accuracy and a 20% increase in manipulability compared to existing methods, including the improved clamping weighted least-norm (ICWLN) and zeroing neural network (ZNN) methods. These results confirm the proposed framework’s effectiveness in tackling kinematic control challenges in humanoid robotics.


WeBT6	301
Deep Learning in Grasping and Manipulation 2	Regular Session
Chair: Zhang, Xuebo	Nankai University,
Co-Chair: Wan, Qin	Hunan Institute of Engineering

13:20-13:25, Paper WeBT6.1
SemSegGrasp: Plug-And-Play Task-Oriented Grasping Via Semantic Segmentation

He, Cheng	Nankai University
Zhao, Zhenjie	Nankai Univeristy
Zhang, Xuebo	Nankai University,
Keywords: Deep Learning in Grasping and Manipulation, Grasping Abstract: Task-oriented grasping (TOG) involves grasping specific parts of an object based on a task instruction. Existing methods generally integrate semantic analysis with grasp detection, resulting in low efficiency and poor generalization when dealing with new tasks and hardware. To address these problems, we propose SemSegGrasp, which reformulates TOG as a semantic segmentation problem based on point cloud and text matching. By decomposing TOG into semantic segmentation and grasp detection, SemSegGrasp can significantly enhance both the efficiency and generalization performance of TOG. Moreover, it can be combined with any off-the-shelf grasp detection algorithms in a plug-and-play manner. For semantic segmentation, SemSegGrasp first utilizes a Vision-Language Model (VLM) to generate local geometric descriptions of the target object. These descriptions are then fed into a Large Language Model (LLM) along with user instructions to obtain operational guidance. Subsequently, we separately encode the input point cloud and the operational guidance and obtain their features. Leveraging multi-head cross-attention, we conduct a matching process between these two types of features to predict the probability of each point serving as a TOG grasp point, i.e. semantic segmentation. Finally, the grasp pose is determined by fusing the segmentation results with the candidate poses generated by an existing grasp detection algorithm. Experimental results on the publicly available TaskGrasp dataset and a real-world setting show that our SemSegGrasp method achieves state-of-the-art performance, outperforming existing methods by at least 5% and 10% on new tasks, respectively.

13:25-13:30, Paper WeBT6.2
Simultaneous Pick and Place Detection by Combining SE(3) Diffusion Models with Differential Kinematics

Ko, Tianyi	Woven by Toyota, Inc
Ikeda, Takuya	Woven by Toyota, Inc
Opra, István Balázs	Woven by Toyota / University of Bonn
Nishiwaki, Koichi	Woven by Toyota
Keywords: Deep Learning in Grasping and Manipulation, Manipulation Planning, Grasping Abstract: Grasp detection methods typically target the detection of a set of free-floating hand poses that can grasp the object. However, not all of the detected grasp poses are executable due to physical constraints. Even though it is straightforward to filter invalid grasp poses in the post-process, such a two-staged approach is computationally inefficient, especially when the constraint is hard. In this work, we propose an approach to take the following two constraints into account during the grasp detection stage, namely, (i) the picked object must be able to be placed with a predefined configuration without in-hand manipulation (ii) it must be reachable by the robot under the joint limit and collision-avoidance constraints for both pick and place cases. Our key idea is to train an SE(3) grasp diffusion network to estimate the noise in the form of spatial velocity, and constrain the denoising process by a multi-target differential inverse kinematics with an inequality constraint, so that the states are guaranteed to be reachable and placement can be performed without collision. In addition to an improved success ratio, we experimentally confirmed that our approach is more efficient and consistent in computation time compared to a naive two-stage approach.

13:30-13:35, Paper WeBT6.3
AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter

Tang, Yingbo	Institute of Automation, Chinese Academy of Sciences
Zhang, Shuaike	Shandong University
Hao, Xiaoshuai	Samsung Research China - Beijing (SRC-B)
Wang, Pengwei	Beijing Academy of Artificial Intelligence
Wu, Jianlong	Harbin Institute of Technology (Shenzhen)
Wang, Zhongyuan	BAAI
Zhang, Shanghang	Peking University
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Perception for Grasping and Manipulation Abstract: Inferring the affordance of an object and grasping it in a task-oriented manner is crucial for robots to successfully complete manipulation tasks. Affordance indicates where and how to grasp an object by taking its functionality into account, serving as the foundation for effective task-oriented grasping. However, current task-oriented methods often depend on extensive training data that is confined to specific tasks and objects, making it difficult to generalize to novel objects and complex scenes. In this paper, we introduce AffordGrasp, a novel open-vocabulary grasping framework that leverages the reasoning capabilities of vision-language models (VLMs) for in-context affordance reasoning. Unlike existing methods that rely on explicit task and object specifications, our approach infers tasks directly from implicit user instructions, enabling more intuitive and seamless human-robot interaction in everyday scenarios. Building on the reasoning outcomes, our framework identifies task-relevant objects and grounds their part-level affordances using a visual grounding module. This allows us to generate task-oriented grasp poses precisely within the affordance regions of the object, ensuring both functional and context-aware robotic manipulation. Extensive experiments demonstrate that AffordGrasp achieves state-of-the-art performance in both simulation and real-world scenarios, highlighting the effectiveness of our method. We believe our approach advances robotic manipulation techniques and contributes to the broader field of embodied AI.

13:35-13:40, Paper WeBT6.4
Occlusion-Aware 6D Pose Estimation with Visual Observation Guided Diffusion Model

Xiong, Yanbin	Shenzhen Institute of Advanced Technology Chinese Academy of Sci
Huang, Buzhen	Southeast University
Hui, Ma	Beijing Normal-Hong Kong Baptist University(BNBU), Shenzhen Inst
Yu, Liu	Shenzhen Institute of Advanced Technology Chinese Academy of Sci
Cheng, Jun	Shenzhen Institutes of Advanced Technology
Keywords: Deep Learning in Grasping and Manipulation, AI-Enabled Robotics, Computer Vision for Manufacturing Abstract: Category-level 6D pose estimation in cluttered and occluded environments is a challenging task. Most existing methods rely on deterministic point-based correspondences to estimate target poses, which cannot consider the uncertainty for occluded objects, and thus result in inferior performance. In this paper, we propose a diffusion model guided by occlusion-aware observations to adaptively refine the object poses in occluded and cluttered scenes. Specifically, we first extract various 2D and 3D features from an RGB-D image to construct the conditions of diffusion model. In the reverse diffusion process, the model is guided by implicit correspondences, perception distance, and occlusion relationships to refine the noisy pose sampled from a standard Gaussian distribution. With several denoising steps, our method can produce accurate results that are consistent with image observations in occluded scenarios. The experimental results show that the proposed method can outperform baseline methods in major metrics in occlusion scenarios. Furthermore, our approach can also be applied in robotic grasping and manipulation tasks through grasping experiments in a cluttered enviroment on a physical UR5 robot.

13:40-13:45, Paper WeBT6.5
Region-Aware 6D Grasping for Industrial Bin-Picking: A Sim2Real Label Self-Generation and Hybrid Evaluation Framework

Zhong, Xungao	Xiamen University of Technology
Gong, Tao	Xiamen University of Technology
Zhong, Xunyu	Xiamen University
Liu, Qiang	University of Bristol
Hu, Huosheng	University of Essex
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Deep Learning for Visual Perception Abstract: The integration of high-quality datasets, a generalized network model, and robust evaluation strategies sets a significant benchmark for advancing policy development in industrial bin-picking. This paper introduces the concept of region-aware grasping, a cutting-edge simulation to reality system designed to generate and evaluate 6D poses, empowering robots to grasp novel workpieces in stacked environments. The proposed system comprises two core components: the Sim2Real dataset, a large-scale synthetic point cloud dataset for grasp analysis, and Semantic-GraspNet, a policy framework that predicts full 6D grasp poses for stacked objects. By encoding and decoding point cloud data, Semantic-GraspNet innovatively transforms the pose prediction into a semantic categorization problem. Furthermore, we present a hybrid evaluation strategy that integrates pose assessment with mechanical grasp performance analysis, thereby enhancing both grasp success rates and sorting efficiency. To extend its capabilities, Semantic-GraspNet is combined with multi-modal large models, enabling accurate object-category-specific grasping in complex bin-picking scenarios. In real-world industrial applications, the system achieves a grasp completion rate of 91.3% in cluttered scenes and 89.2% in densely stacked environments, showcasing state-of-the-art performance in robotic picking and placing tasks.

13:45-13:50, Paper WeBT6.6
Learning Adaptive Dexterous Grasping from Single Demonstrations

Shi, Liangzhi	Tsinghua University
Liu, Yulin	UCSD
Zeng, Lingqi	Hong Kong University of Science and Technology
Ai, Bo	National University of Singapore
Hong, Zhengdong	Zhejiang University
Su, Hao	UCSD
Keywords: Deep Learning in Grasping and Manipulation, Dexterous Manipulation, Reinforcement Learning Abstract: How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-language model (VLM). To improve sample efficiency, we propose a trajectory following reward that guides reinforcement learning (RL) toward states close to a human demonstration while allowing flexibility in exploration. To learn beyond the single demonstration, we employ curriculum learning, progressively increasing object pose variations to enhance robustness. At deployment, a VLM retrieves the appropriate skill based on user instructions, bridging low-level learned skills with high-level intent. We evaluate AdaDexGrasp in both simulation and real-world settings, showing that our approach significantly improves RL efficiency and enables learning human-like grasp strategies across varied object configurations. Finally, we demonstrate zero-shot transfer of our learned policies to a real-world PSYONIC Ability Hand, with a 90% success rate across objects, significantly outperforming the baseline.

13:50-13:55, Paper WeBT6.7
DNAct: Diffusion Guided Multi-Task 3D Policy Learning

Yan, Ge	University of California San Diego
Wu, Yueh-Hua	University of California, San Diego
Wang, Xiaolong	UC San Diego
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Representation Learning Abstract: This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications for challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Videos are available on dnact.github.io.

13:55-14:00, Paper WeBT6.8
FFBGNet: Full-Flow Bidirectional Feature Fusion Grasp Detection Network Based on Hybrid Architecture

Wan, Qin	Hunan Institute of Engineering
Ning, ShunXing	Hunan Institute of Engineering
Tan, Haoran	Hunan University
Wang, Yaonan	Hunan University
Duan, Xiaogang	Central South University
Li, Zhi	Hunan Institute of Engineering
Yang, Yang	Hunan Zhongnan Intelligent Equipment Co., Ltd
Qiu, Jianhua	Hunan University
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Perception for Grasping and Manipulation Abstract: Effectively integrating the complementary information from RGB-D images presents a significant challenge in robotic grasping. In this letter, we propose a full-flow bidirectional feature fusion grasp detection network (FFBGNet) based on a hybrid architecture to generate accurate grasp poses from RGB-D images. First, we construct an efficient Cross-Modal Feature fusion module as a bridge for information interaction in the full flow of the two branches, where fusion is applied to each encoding and decoding layer. Then, the two branches can fully leverage the appearance information in the RGB images and the geometry information from the depth images. Second, a hybrid architecture module for CNNs and Transformer parallel is developed to achieve better local feature and global information representations. Finally, we conduct qualitative and quantitative comparative experiments on the Cornell and Jacquard datasets, achieving grasping detection accuracies of 99.2% and 96.5%, respectively. Simultaneously, in physical grasping experiments, the FFBGNet achieves a 96.7% success rate in cluttered scenes, which further demonstrates the reliability of the proposed method.


WeBT7	307
Motion and Path Planning 6	Regular Session
Chair: Vonasek, Vojtech	Czech Technical University in Prague

13:20-13:25, Paper WeBT7.1
VL-TGS: Trajectory Generation and Selection Using Vision Language Models in Mapless Outdoor Environments

Song, Daeun	George Mason University
Liang, Jing	University of Maryland
Xiao, Xuesu	George Mason University
Manocha, Dinesh	University of Maryland
Keywords: Motion and Path Planning, Task and Motion Planning, Integrated Planning and Learning Abstract: We present a multi-modal trajectory generation and selection algorithm for real-world mapless outdoor navigation in human-centered environments. Such environments contain rich features like crosswalks, grass, and curbs, which are easily interpretable by humans, but not by mobile robots. We aim to compute suitable trajectories that (1) satisfy the environment-specific traversability constraints and (2) generate human-like paths while navigating on crosswalks, sidewalks, etc. Our formulation uses a Conditional Variational Autoencoder (CVAE) generative model enhanced with traversability constraints to generate multiple candidate trajectories for global navigation. We develop a visual prompting approach and leverage the Visual Language Model's (VLM) zero-shot ability of semantic understanding and logical reasoning to choose the best trajectory given the contextual information about the task. We evaluate our method in various outdoor scenes with wheeled robots and compare the performance with other global navigation algorithms. In practice, we observe an average improvement of 20.81% in satisfying traversability constraints and 28.51% in terms of human-like navigation in four different outdoor navigation scenarios.

13:25-13:30, Paper WeBT7.2
Tree-Based Grafting Approach for Bidirectional Motion Planning with Local Subsets Optimization

Zhang, Liding	Technical University of Munich
Ling, Yao	Technical University of Munich (TUM)
Bing, Zhenshan	Technical University of Munich
Wu, Fan	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Motion and Path Planning, Task and Motion Planning, Manipulation Planning Abstract: Bidirectional motion planning often reduces planning time compared to its unidirectional counterparts. It requires connecting the forward and reverse search trees to form a continuous path. However, this process could fail and restart the asymmetric bidirectional search due to the limitations of lazy-reverse search. To address this challenge, we propose Greedy GuILD Grafting Trees (G3T), a novel path planner that grafts invalid edge connections at both ends to re-establish tree-based connectivity, enabling rapid path convergence. G3T employs a greedy approach using the minimum Lebesgue measure of guided incremental local densification (GuILD) subsets to optimize paths efficiently. Furthermore, G3T* dynamically adjusts the sampling distribution between the informed set and GuILD subsets based on historical and current cost improvements, ensuring asymptotic optimality. These features enhance the forward search's growth towards the reverse tree, achieving faster convergence and lower solution costs. Benchmark experiments across dimensions from R^2 to R^8 and real-world robotic evaluations demonstrate G3T*'s superior performance compared to existing single-query sampling-based planners. A video showcasing our experimental results is available at: https://youtu.be/3mfCRL5SQIU.

13:30-13:35, Paper WeBT7.3
A Bezier Path Optimization Algorithm for Flying Chip-Ejection Mass Transfer Technology (I)

Sun, JianFeng	Guangdong University of Technology, State Key Laboratory of Prec
Chen, Xun	Guangdong University of Technology
Lin, Guohuai	Guangdong University of Technology
Yao, JingSong	Guangdong University of Technology
Chen, Xin	Guangdong University of Technology
Keywords: Motion and Path Planning, Task and Motion Planning, Optimization and Optimal Control Abstract: In this paper, a path optimization algorithm using the Bezier curve for flying chip ejection mass transfer technology is proposed. The optimization problem is firstly constructed by considering the flying chip ejection requirements. Then, the properties of Bezier curve is introduced to expand the search space for the optimization problem. After that, a Bezier optimization algorithm is proposed. Comparing with the previous algorithms by simulations and experiments, the proposed algorithm is more suitable for optimization problems with strict constraints.

13:35-13:40, Paper WeBT7.4
PathCluster: Pedestrian Group-Adaptive Social Navigation in Dense Crowds

Gunukula, Nihal	Purdue University
Bera, Aniket	Purdue University
Keywords: Motion and Path Planning, Task and Motion Planning, Task Planning Abstract: Mobile robot navigation in crowded spaces is crucial for deployment but remains challenging in extremely dense environments. While recent works have utilized predictive measures to anticipate individual trajectories, bettering overall navigation, these methods often fail to scale in high-density crowds due to computational intensity. To mitigate this issue we propose PathCluster, a novel approach that generates groups based on individuals with similar trajectories to enable efficient navigation in dense crowds. Our method introduces a group generator algorithm that identifies and treats clusters as cohesive units, significantly decreasing the complexity of trajectory prediction while maintaining its benefits. Simulation results demonstrate that our method, PathCluster, achieves a 45% higher success rate, and a 25% lower collision rate, and can tackle more challenging navigational tasks within a 48hr-time limit compared to previous social navigation models in extremely crowded environments.

13:40-13:45, Paper WeBT7.5
Heuristic Search for Path Finding with Refuelling

Zhao, Shizhe	Shanghai Jiao Tong University
Nandy, Anushtup	Carnegie Mellon University
Choset, Howie	Carnegie Mellon University
Rathinam, Sivakumar	TAMU
Ren, Zhongqiang	Shanghai Jiao Tong University
Keywords: Motion and Path Planning, Task and Motion Planning Abstract: This paper considers a generalization of the Path Finding (PF) with refueling constraints referred to as the Gas Station Problem (GSP). Similar to PF, given a graph where vertices are gas stations with known fuel prices, and edge costs are the gas consumption between the two vertices, GCS seeks a minimum-cost path from the start to the goal vertex for a robot with a limited gas tank and a limited number of refuelling stops. While GCS is polynomial-time solvable, it remains a challenge to quickly compute an optimal solution in practice since it requires simultaneously determine the path, where to make the stops, and the amount to refuel at each stop. This paper develops a heuristic search algorithm called Refuel A* (RF-A) that iteratively constructs partial solution paths from the start to the goal guided by a heuristic while leveraging dominance rules for pruning during planning. RF-A is guaranteed to find an optimal solution and often runs 2 to 8 times faster than the existing approaches in large city maps with several hundreds of gas stations.

13:45-13:50, Paper WeBT7.6
Asymptotically Optimal Path Planning with an Approximation of the Omniscient Set

Kriz, Jonas	Czech Technical University in Prague
Vonasek, Vojtech	Czech Technical University in Prague
Keywords: Motion and Path Planning, Task and Motion Planning Abstract: The asymptotically optimal version of Rapidly-exploring Random Tree (RRT) is often used to find optimal paths in a high-dimensional configuration space. The well-known issue of RRT is its slow convergence towards the optimal solution. A possible solution is to draw random samples only from a subset of the configuration space that is known to contain configurations that can improve the cost of the path (omniscient set). A fast convergence rate may be achieved by approximating the omniscient with a low-volume set. In this paper, we propose new methods to approximate the omniscient set and methods for their effective sampling. First, we propose to approximate the omniscient set using several (small) hyperellipsoids defined by sections of the current best solution. The second approach approximates the omniscient set by a convex hull computed from the current solution. Both approaches ensure asymptotical optimality and work in a general n-dimensional configuration space. The experiments have shown superior performance of our approaches in multiple scenarios in 3D and 6D configuration spaces. The proposed methods will be available as open-source on https://github.com/BipoaroXigen/JPL.

13:50-13:55, Paper WeBT7.7
Learning to Initialize Trajectory Optimization for Vision-Based Autonomous Flight in Unknown Environments

Chen, Yicheng	Beihang University
Li, Jinjie	The University of Tokyo
Qin, Wenyuan	Beihang University
Hua, Yongzhao	Beihang University
Dong, Xiwang	Beihang University
Li, Qingdong	Beihang University
Keywords: Motion and Path Planning, Vision-Based Navigation, Deep Learning for Visual Perception Abstract: Autonomous flight in unknown environments requires precise spatial and temporal trajectory planning, often involving computationally expensive nonconvex optimization prone to local optima. To overcome these challenges, we present the Neural-Enhanced Trajectory Planner (NEO-Planner), a novel approach that leverages a Neural Network (NN) Planner to provide informed initial values for trajectory optimization. The NN-Planner is trained on a dataset generated by an expert planner using batch sampling, capturing multimodal trajectory solutions. It learns to predict spatial and temporal parameters for trajectories directly from raw sensor observations. NEO-Planner starts optimization from these predictions, accelerating computation speed while maintaining explainability. Furthermore, we introduce a robust online replanning framework that accommodates planning latency for smooth trajectory tracking. Extensive simulations demonstrate that NEO-Planner reduces optimization iterations by 20%, leading to a 26% decrease in computation time compared with pure optimization-based methods. It maintains trajectory quality comparable to baseline approaches and generalizes well to unseen environments. Real-world experiments validate its effectiveness for autonomous drone navigation in cluttered, unknown environments.

13:55-14:00, Paper WeBT7.8
LITE: A Learning-Integrated Topological Explorer for Multi-Floor Indoor Environments

Chen, Junhao	Zhejiang University
Zhang, Zhen	Zhejiang University
Zhu, Chengrui	Zhejiang University
Hou, Xiaojun	Zhejiang University
Hu, Tianyang	Zhejiang University
Wu, Huifeng	Hangzhou Dianzi University
Liu, Yong	Zhejiang University
Keywords: Motion and Path Planning Abstract: This work focuses on multi-floor indoor exploration, which remains an open area of research. Compared to traditional methods, recent learning-based explorers have demonstrated significant potential due to their robust environmental learning and modeling capabilities, but most are restricted to 2D environments. In this paper, we proposed a learning-integrated topological explorer, LITE, for multi-floor indoor environments. LITE decomposes the environment into a floor-stair topology, enabling seamless integration of learning or non-learning-based 2D exploration methods for 3D exploration.As we incrementally build floor-stair topology in exploration using YOLO11-based instance segmentation model, the agent can transition between floors through a finite state machine.Additionally, we implement an attention-based 2D exploration policy that utilizes an attention mechanism to capture spatial dependencies between different regions, thereby determining the next global goal for more efficient exploration.Extensive comparison and ablation studies conducted on the HM3D and MP3D datasets demonstrate that our proposed 2D exploration policy significantly outperforms all baseline explorers in terms of exploration efficiency.Furthermore, experiments in several 3D multi-floor environments indicate that our framework is compatible with various 2D exploration methods, facilitating effective multi-floor indoor exploration.Finally, we validate our method in the real world with a quadruped robot, highlighting its strong generalization capabilities.


WeBT8	308
Micro/Nano Robots 5	Regular Session
Chair: Zhao, Xin	Nankai University
Co-Chair: Liu, Na	Shanghai University, Shanghai, China

13:20-13:25, Paper WeBT8.1
Haptic Feedback Control Strategy for Microswarm Navigation in Flowing Environments

Cao, Ying	Southeast University
Yuan, Yanjia	Southeast University
Yang, Qijun	Southeast University
Luo, Shengming	Southeast University
An, Xuanyu	Southeast University
Zhang, Haoyu	Southeast University
Du, Jiansheng	Southeast University
Wang, Xiaoyu	Southeast University
Wang, Qianqian	Southeast University
Keywords: Micro/Nano Robots, Motion Control, Soft Sensors and Actuators Abstract: Swarming microrobots offer great promise for targeted delivery in biofluidic environments. However, current approaches insufficiently utilize the operator's perceptual awareness and interactive decision-making capabilities. This work proposes a real-time navigation and control strategy with haptic feedback for delivering magnetic microswarm, in which the haptic feedback system provides microswarm-environment interaction to the operator. The real-time tracking system continuously monitors the position and shape of the microswarm in the remote environment, transmitting data to the control system for decision-making. This integration can achieve real-time perception and feedback of the microswarm's state and motion process. Moreover, the strategy successfully demonstrates three-dimensional (3D) navigation and shape-adaptive regulation of the microswarm under static, downstream, and upstream flow conditions. The experimental results show that the haptic feedback enables real-time trajectory and velocity adjustments during navigation, improving control robustness and delivery accuracy. Our work expands a haptic feedback-enabled microswarm control in dynamic conditions, providing an adaptive swarm control strategy in complex biomedical environments.

13:25-13:30, Paper WeBT8.2
Microfluidics-Based Analysis of Controlled Mixing and Bubble Formation in Soda Solutions for Education

Owusu, Eric Kwame	SHANGHAI UNIVERSITY
Sinzinkayo, Donatien	SHANGHAI UNIVERSITY
Liu, Na	Shanghai University, Shanghai, China
Wang, Yue	Shanghai University
Yue, Tao	Shanghai University
Keywords: Micro/Nano Robots, Nanomanufacturing Abstract: This paper presents a hands-on laboratory experiment for educational and research students to analyze microfluidics-controlled mixing, focusing on fluid dynamics. The experiment utilizes a high-resolution microscope and microfluidic chips, each calibrated to different pH levels. It employs the vinegar-and-baking-soda reaction, providing a safe and visually engaging method for studying fluid behavior. The setup illustrated in Fig. 2 uses a flexible silicon tube syringe pump to inject vinegar and soda solution into the microfluidics system. The study examines bubble formation with varying soda concentrations, revealing a non-linear relationship between soda content and bubble size. At 0.2M, the mean bubble area was 389.06 µm², with the largest bubble measuring 2587.5 µm². At 0.4M, the bubble size decreased to 137 µm, while the frequency increased to 105. At higher concentrations, bubble sizes stabilized, producing round shapes. The flow rates of the soda solution were set to increase from 3 μL min−1 to 15 μL min−1. The results indicate that the microfluidics system generated tiny bubbles under stable laminar flow with occasional more prominent outliers. Bubbles exhibited an irregular shape influenced by the flow dynamics and concentration, with most bubbles formed along the channel walls, highlighting the flow of fluid interactions in microchannels. Keywords: Microfluidics, education; controlled mixing; vinegar-baking soda reactions; bubble formation

13:30-13:35, Paper WeBT8.3
A Bio-Inspired Spherical Soft Magnetic Millirobot for Gastrointestinal Applications

Li, Yulin	Nankai University
Hong, Zhaorui	Nankai University
Zhao, Yuhao	Nankai University
Zhang, Shuohao	Nankai University
Zhao, Xin	Nankai University
Yang, Liu	Nankai University
Keywords: Micro/Nano Robots, Soft Robot Applications, Medical Robots and Systems Abstract: Gastroscopy and colonoscopy have become the fundamental tools for gastrointestinal (GI) tract diagnosis and treatment. Conventional tethered devices usually lead to the use of anesthetic agents and patient discomfort. Capsule endoscopy has becoming an ideal alternative, however, the smooth capsule shape may hinder its active locomotion and retention in the GI tract. Here, we propose a spherical soft magnetic millirobot (S2M2 robot) that integrates a virus-like spherical body with protrusions and an octopus-inspired sucker design for better motion abilities. The protruding suckers (radius 2.2~4.8 mm) on its surface enables efficient movement performance (maximum angular velocity 8 r/s, maximum speed 180 mm/s), with strong adhesion ability (maximum 3.5 N) which could improve the operational accuracy. The ex vivo experiment in a swine stomach verifies the robot’s locomotion ability on the slippery surface of the gastric mucosa and effectiveness of the pressure-based drug delivery system. The in vitro and ex vivo results highlight the superior mobility and controllability, showcasing its potential as a carrier robot for the next-generation capsule endoscopy.

13:35-13:40, Paper WeBT8.4
Micro-Robotic Swarm of Silicone Oil-Based Ferrofluid’s Micro-Droplets

Fu, Yulei	Shanghai Jiao Tong University
Yu, Hengao	Shanghai Jiao Tong University
Chen, Leilei	Shanghai Jiao Tong University
Zheng, Zhiteng	Harbin Institute of Technology
Wang, Wendong	Shanghai Jiao Tong University
Keywords: Micro/Nano Robots, Soft Sensors and Actuators, Automation at Micro-Nano Scales Abstract: Microscale droplet-based robotic systems have emerged as a promising platform for targeted drug delivery, minimally invasive surgery, and lab-on-a-chip applications. Here, we report a novel microrobotic swarm based on microdroplets of silicone oil-based ferrofluid, which exhibits excellent biocompatibility and chemical inertness. By modulating three-dimensional magnetic fields, we achieved reconfigurable self-organized patterns of an aggregated state, a dispersed state, and a chain state. We established a dynamic model and reproduced the three states via numerical simulations. Furthermore, we discovered two locomotion modes: sliding and rolling. Utilizing the sliding mode, we navigated the swarm through narrow and complex channels and accomplished directional transport of bubbles, enabling both translational and rotational movements.

13:40-13:45, Paper WeBT8.5
Magnetically Actuated Steerable Catheter with Redundant DoF for Cardiovascular Interventions

Liao, Hongzhe	Beijing Institute of Technology
Jin, Han	Peking University First Hospital
Du, Jialong	Beijing Institute of Technology
Liang, Xiyue	Beijing Institute of Technology
Li, Yuke	Beijing Institute of Technology
Huang, Qiang	Beijing Institute of Technology
Arai, Tatsuo	University of Electro-Communications
Liu, Xiaoming	Beijing Institute of Technology
Keywords: Micro/Nano Robots, Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems Abstract: A magnetically controlled catheter system is proposed to enhance the precision and safety of vascular interventions by reducing procedure time and radiation exposure. The system can also function as a support channel for guidewire deployment. A novel navigation approach is introduced, employing an external permanent magnet capable of controlled rotation to actuate a catheter with an embedded magnetic tip. By leveraging magnetic interactions and a redundant rotational degree of freedom, the system enables small-angle control of the catheter tip without requiring extensive spatial movement, thus improving maneuverability in confined vascular regions. The magnetic field distribution and its influence on catheter response are characterized, and a kinematic model of the actuation mechanism is established. Experimental validation is conducted under varying magnetic field strengths and orientations, demonstrating reliable steering performance. Application-based experiments in simulated clinical environments further confirm precise navigation capability. The results highlight the advantages of rotational magnetic control in enhancing flexibility and accuracy. The proposed system presents a promising solution for automating catheter-based interventions, offering improved efficiency and power in minimally invasive procedures.

13:45-13:50, Paper WeBT8.6
Magnetic Microswarms with Controlled Locomotion in Liquid and Air Environments

Chen, Ziheng	Shanghai University
Yu, Jiangfan	Chinese University of Hong Kong, Shenzhen
Liu, Na	Shanghai University, Shanghai, China
Keywords: Micro/Nano Robots Abstract: Magnetic microswarms have attracted significant attention in medical robotics, owing to their potential for performing complex tasks in challenging environments. However, developing microswarms that can operate effectively in both liquid and air environments remains a substantial challenge. This study presents the design and characterization of hydrogel-based microswarms composed of magnetic hydrogel particles prepared from agarose hydrogel and NdFeB magnetic microparticles. These microswarms form stable monolayer structures actuated by rotating magnetic fields at high frequencies (10 Hz) in liquid environments, enabling synchronization with the external magnet and achieving translational motion. Actuated by an oscillating magnetic field, the swarms transition from a monolayer configuration to a three-dimensional (3D) structure in the air environment. Experimental results demonstrate that the 3D swarms are capable of navigating complex terrains and interacting with tissue surfaces in air environments. Finally, we demonstrate the potential of these 3D swarms for targeted delivery and adaptive filling of gastric perforations using an ex vivo gastric tissue model, showcasing their potential for biomedical applications.

13:50-13:55, Paper WeBT8.7
A Light Controlled Micromixer Using Optoelectronic Tweezers

Yan, Feng	Shanghai University
Liu, Peisen	Shanghai University
Zheng, Lixiang	Shanghai University
Zhang, Yuzhao	Shanghai University
Yue, Tao	Shanghai University
Liu, Na	Shanghai University, Shanghai, China
Keywords: Micro/Nano Robots Abstract: This work presents a flexible and effective micromixer based on optoelectronic tweezers (OET), which leverages both asymmetric induced-charge electro-osmosis (ICEO) and dielectrophoresis (DEP) phenomena on microscale anisotropic NdFeB particles. The asymmetric ICEO phenomenon is generated by symmetry breaking in the induced charge distributions of geometrically anisotropic NdFeB particles under AC electric field polarization. The DEP forces exerted on NdFeB particles are induced by the light-generated non-uniform electric field. Under the combined action of hydrodynamic forces from asymmetric ICEO vortices and positive DEP forces, NdFeB particles can be attracted into light-induced "virtual" electrodes and precisely track along light-defined trajectories. Experimental results demonstrate that the maximum motion speed of the NdFeB particles exceeds 300 μm/s, with the motion speed exhibiting a positive correlation with the applied voltage. Dynamically controlled virtual electrodes enable accurate capture and relocation of microparticles to arbitrary target positions. The stirring and mixing capability of the NdFeB particles is demonstrated by driving yeast cell motion.

13:55-14:00, Paper WeBT8.8
Gesture Identification and Object Temperature Detection of a Robotic Hand Using a Wireless Flexible Sensing Feedback Control System

Wu, Lining	Beijing Institute of Technology
Xiao, Shuai	Beijing Institute of Technology
Li, Chunyang	Beijing Institute of Technology
Zhang, Fanqing	Beijing Institute of Technology
Li, Zhongyi	Beijing Institute of Technology
Lv, Chengzhai	Beijing Institute of Technology
Du, Ming	Beijing Institute of Technology
Dong, Lixin	City University of Hong Kong
Zhao, Jing	Beijing Institute of Technology
Keywords: Nanomanufacturing, Soft Sensors and Actuators, Haptics and Haptic Interfaces Abstract: The ability of robotic hands to sense their environment and provide feedback is becoming increasingly vital for advanced robotic systems. Real-time tactile interaction is crucial for ensuring safe and effective human-machine collaboration. However, conventional sensors face significant challenges in accurately sensing and distinguishing multiple physical signals, thus limiting their applications in intelligent robotic systems. Herein, we developed a control system comprising multifunctional sensor, corresponding circuitry, and a feedback control unit, enabling a robotic hand to perceive and respond to environmental strain and temperature during grasping tasks. Therein, a P(VDF-TrFE) membrane and a nanographene sheet are combined to form the sensor. A linear inverse method is proposed to decouple the strain and temperature responses, simplifying the process. Furthermore, we designed an acquisition circuit that uses Bluetooth transmission to display the decoupled signals on a mobile device. Then the feedback control section is designed based on the data collected by the measurement circuit. Finally, the flexible multimodal sensor patch is attached to a robotic hand, demonstrating its ability to precisely identify varying strains and temperatures while grasping objects with an accuracy of 0.8 °C and 0.013%. And it can provide feedback control for the robotic hand's actions in response to the sensed strain and temperature changes. The proposed sensing system integrates flexible structure, wireless signal transmission, and feedback control, along with exceptional sensitivity and linearity in strain and temperature detection. The characteristics facilitate the application in intelligent robotic systems.


WeBT9	309
Object Detection, Segmentation and Categorization 2	Regular Session
Chair: Li, Qiang	Shenzhen Technology University
Co-Chair: Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)

13:20-13:25, Paper WeBT9.1
SGDet3D: Semantics and Geometry Fusion for 3D Object Detection Using 4D Radar and Camera

Bai, Xiaokai	Zhejiang University
Yu, Zhu	Zhejiang University
Zheng, Lianqing	TONGJI University
Zhang, Xiaohan	Zhejiang University
Zhou, Zili	Zhejiang University
Zhang, Xue	Zhejiang University
Wang, Fang	Hangzhou City University
Bai, Jie	Hangzhou City University
Shen, Hui-liang	Zhejaing University
Keywords: Object Detection, Segmentation and Categorization, Sensor Fusion, Deep Learning Methods Abstract: 4D millimeter-wave radar has gained attention as an emerging sensor for autonomous driving in recent years. However, existing 4D radar and camera fusion models often fail to fully exploit complementary information within each modality and lack deep cross-modal interactions. To address these issues, we propose a novel 4D radar and camera fusion method, named SGDet3D, for 3D object detection. Specifically, we first introduce a dual-branch fusion module that employs geometric depth completion and semantic radar PillarNet to comprehensively leverage geometric and semantic information within each modality. Then we introduce an object-oriented attention module that employs localization-aware cross-attention to facilitate deep interactions across modalites by allowing queries in bird's-eye view (BEV) to attend to interested image tokens. We validate our SGDet3D on the TJ4DRadSet and View-of-Delft (VoD) datasets. Experimental results demonstrate that SGDet3D effectively fuses 4D radar data and camera image and achieves state-of-the-art performance. Our code will be made publicly available after acceptance.

13:25-13:30, Paper WeBT9.2
Interactive Object Detection by Mitigating Uncertainty of Robot Task Plans Using Large Language Model

Suzuki, Kanata	Fujitsu Limited
Ushizaka, Akane	Waseda University
Hori, Kazuki	Waseda University
Ogata, Tetsuya	Waseda University
Keywords: Object Detection, Segmentation and Categorization, AI-Based Methods, Deep Learning for Visual Perception Abstract: Recently, many attempts have been made to integrate the foundation model with robotics. In most of those attempts, the model recognition results were treated as unique; however, the recognition results required for real robot tasks vary with the task goal. The recognition results of the foundation model are determined from the detection query; therefore, in the case of an ambiguous query, the query must be modified to match the purpose of the robot task. In this study, we propose an object recognition method that considers the task goal through application of an interactive task planning method using a large language model. The proposed method clarifies the purpose of the robot task by asking the user a question. Hence, uncertainty in the task plan due to ambiguous operation instructions is mitigated. During the task plan update arising from the dialog process, the object-detection results obtained from the query in the planning results are also updated to match the task goal. In our experiments, the proposed method's effectiveness is verified quantitatively and qualitatively via object-detection tasks conducted on a custom-built verification dataset.

13:30-13:35, Paper WeBT9.3
QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries

Chapman, Nicolas Harvey	Queensland University of Technology
Dayoub, Feras	The University of Adelaide
Browne, Will	Queensland University of Technology
Lehnert, Christopher	Queensland University of Technology
Keywords: Object Detection, Segmentation and Categorization, Transfer Learning, Deep Learning for Visual Perception Abstract: A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.

13:35-13:40, Paper WeBT9.4
EFCWM-Mamba-YOLO: Real-Time Underwater Object Detection with Adaptive Feature Representation and Domain Adaptation

Sun, Pan	Shenzhen Technology University
Lu, Yu	Shenzhen Technology University
Shi, Shijie	Shenzhen Technology University
Li, Meng	Shenzhen Technology University
Li, Qiang	Shenzhen Technology University
Ge, Huilin	Jiangsu University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, AI-Based Methods Abstract: Underwater object detection (UOD) is crucial for monitoring marine ecosystems, underwater robotics, environmental protection, and autonomous underwater vehicles (AUVs). Despite progress, many models struggle under real-world conditions due to poor visibility, dynamic lighting, and domain shifts. Traditional methods like Faster R-CNN are computationally expensive, while YOLO-based models suffer in challenging underwater scenarios. The scarcity of large-scale annotated datasets further limits model generalization. To address these challenges, we introduce UOD-SZTU-2025, a new dataset of 3,133 high-quality underwater images, sourced primarily from video platforms. The dataset is used in EFCWM (Enhanced Feature Correction and Weighting Module) to extract and refine a feature material library for detection targets. We propose EFCWM-Mamba-YOLO, a lightweight, real-time detection model designed to enhance feature representation and adapt to diverse underwater environments. The EFCWM module incorporates domain adaptation for improved robustness. Additionally, a two-stage training strategy first trains on a source domain and fine-tunes with limited target domain samples to enhance generalization. Experiments show our approach surpasses existing lightweight UOD models in accuracy, real-time performance, and robustness. Our dataset, model, and benchmark establish a strong foundation for future UOD research. The dataset for EFCWM-Mamba-YOLO is available at https://github.com/sunpan/UOD-SZTU-2025.

13:40-13:45, Paper WeBT9.5
BEVPointNet3D: Fusing Bird's Eye View and Point Cloud Features for Robust 3D Lane Detection

Yuan, Xia	Nanjing University of Science and Technology
Zhai, Yanrui	Nanjing University of Science and Technology
Jing, Zihui	Nanjing University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Autonomous Vehicle Navigation, Computer Vision for Transportation Abstract: This paper introduces BEVPointNet3D, an innovative 3D lane detection model that effectively integrates Bird's Eye View (BEV) and point cloud features. The proposed approach addresses the inherent limitations of conventional methods that predominantly rely on the flat-ground assumption. BEVPointNet3D incorporates a 2D encoder to extract preliminary lane information from BEV images. For 3D feature extraction, the model employs a hierarchical local-to-global processing scheme to capture the geometric characteristics of LiDAR point clouds. A novel cross-attention mechanism is implemented to precisely align and integrate the 2D and 3D feature representations. This architectural design not only improves detection accuracy but also strengthens the adaptability and performance of the model in complex driving scenarios. Comprehensive evaluations on the K-Lane and CampusLane datasets demonstrate the superior performance of BEVPointNet3D. Notably, the model exhibits exceptional capability in accurately estimating lane spatial positions on steep inclines, thereby providing reliable support for autonomous driving systems in challenging terrain conditions.

13:45-13:50, Paper WeBT9.6
Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding (I)

Xu, Mengya	National University of Singapore
Islam, Mobarakol	University College London
Glocker, Ben	Imperial College London
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Object Detection, Segmentation and Categorization, Surgical Robotics: Laparoscopy, Deep Learning for Visual Perception Abstract: Curriculum learning and self-paced learning are the training strategies that gradually feed the samples from easy to more complex. They have captivated increasing attention due to their excellent performance in robotic vision. Most recent works focus on designing curricula based on difficulty levels in input samples or smoothing the feature maps. However, smoothing labels to control the learning utility in a curriculum manner is still unexplored. In this work, we design a paced curriculum by label smoothing (P-CBLS) using paced learning with uniform label smoothing (ULS) for classification tasks and fuse uniform and spatially varying label smoothing (SVLS) for semantic segmentation tasks in a curriculum manner. In ULS and SVLS, a bigger smoothing factor value enforces a heavy smoothing penalty in the true label and limits learning less information. Therefore, we design the curriculum by label smoothing (CBLS). We set a bigger smoothing value at the beginning of training and gradually decreased it to zero to control the model learning utility from lower to higher. We also designed a confidenceaware pacing function and combined it with our CBLS to investigate the benefits of various curricula. The proposed techniques are validated on four robotic surgery datasets of multiclass, multi-label classification, captioning, and segmentation tasks. We also investigate the robustness of our method by corrupting validation data into different severity levels. Our extensive analysis shows that the proposed method improves prediction accuracy and robustness.

13:50-13:55, Paper WeBT9.7
Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Pham, Phu	Purdue University
Patel, Dipam	Purdue University
Conover, Damon	DEVCOM Army Research Laboratory
Bera, Aniket	Purdue University
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, RGB-D Perception Abstract: We introduce Go-SLAM, a novel framework that combines 3D Gaussian Splatting SLAM with grounded object segmentation and open-vocabulary querying to enable object-aware 3D scene reconstruction. Go-SLAM incrementally builds high-fidelity 3D maps from RGB-D inputs while embedding semantic information by assigning unique object identifiers to Gaussian primitives. This integration allows the system to support flexible, natural language queries and accurately localize objects in complex, static environments. To achieve robust semantic mapping, Go-SLAM leverages object detection and segmentation models, enabling consistent object identification across frames without relying on predefined categories. We evaluate Go-SLAM across diverse indoor scenes, demonstrating improvements over existing baselines in both reconstruction quality and object localization accuracy. Our results show that Go-SLAM effectively bridges the gap between geometric mapping and semantic understanding, supporting real-time scene interaction and object retrieval in open-world environments.

13:55-14:00, Paper WeBT9.8
MovSAM: A Single-Image Moving Object Segmentation Framework Based on Deep Thinking

Nie, Chang	SJTU
Xu, Yiqing	China University of Mining and Technology
Wang, Guangming	University of Cambridge
Liu, Zhe	Shanghai Jiao Tong University
Miao, Yanzi	China University of Mining and Technology
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5% on J&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.


WeBT10	310
Recognition 1	Regular Session
Co-Chair: Fu, Hong	The Education University of Hong Kong

13:20-13:25, Paper WeBT10.1
ToSA: Token Merging with Spatial Awareness

Huang, Hsiang-Wei	University of Washington
Chai, Wenhao	University of Washington
Chen, Kuang-Ming	University of Washington
Yang, Cheng-Yen	University of Washington
Hwang, Jenq-Neng	University of Washington
Keywords: Recognition, Deep Learning Methods, Computer Vision for Automation Abstract: Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA.

13:25-13:30, Paper WeBT10.2
G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition

Ren, Haiqing	Institute of Software Chinese Academy of Sciences
Luo, ZhongKai	Institute of Software Chinese Academy of Sciences
Fan, Heng	University of North Texas
Yuan, Xiaohui	University of North Texas
Wang, Guanchen	Chadwick School
Zhang, Libo	Iscas
Keywords: Recognition, Deep Learning Methods, Human and Humanoid Motion Analysis and Synthesis Abstract: Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G^{3}CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G^{3}CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G^{3}CN effectively improves action recognition, particularly for ambiguous samples.

13:30-13:35, Paper WeBT10.3
IMVPR: Implicit BEV-Enhanced Multi-View Aggregation for Visual Place Recognition

Cao, Xu	Beijing Institute of Technology
Zhang, Caibo	Zhejiang University
Liu, Ziming	Beijing Institute of Technology
Zhong, Xuchang	Beijing Institute of Technology
Fang, Hao	Beijing Institute of Technology
Keywords: Recognition, Localization, Representation Learning Abstract: Visual Place Recognition (VPR) is essential for robotics and autonomous driving, enabling localization by matching current observations with a database of known places. While monocular VPR methods rely on visual features, they are sensitive to environmental changes, and multimodal approaches using LiDAR or radar incur high costs and complexity. Multi-view camera configurations offer a cost-effective alternative by expanding perception range and providing richer structural information. In this work, we propose IMVPR, an implicit BEV-enhanced multi-view place recognition network that achieves consistent and parallel multi-view feature fusion and place descriptors aggregation. Unlike methods that explicitly construct BEV features, we introduce descriptor queries to implicitly represent 3D spatial locations, facilitating spatial point projection-based fusion. A cross-attention mechanism further enables end-to-end multi-view feature aggregation. We evaluate IMVPR on four scenes from the nuScenes dataset, including both in-domain and out-of-domain scenarios, demonstrating its superior accuracy and generalization compared to state-of-the-art methods, including multimodal approaches. Our results highlight the potential of multi-view vision-based methods as a scalable and robust solution for VPR.

13:35-13:40, Paper WeBT10.4
TIETracker: A CLIP-Based RGB-T Tracking Via Feature Interaction and Semantic Enhancement

Xia, Weidai	Central South University
Xingliang, Mao	Hunan University of Technology and Business
Wu, Wei	Central South University
Zhu, Chengzhang	Central South University
Fangfang, Li	Central South University
Keywords: Recognition, Representation Learning Abstract: The goal of RGB-T tracking is to enhance the accuracy and robustness by leveraging the complementary features of RGB and TIR modalities in complex scenarios. Previous methods have overlooked the power of semantic features in extracting valuable information from different modalities and improving interactions across them. Moreover,using Bounding Boxes (BBox) for target initialization can cause issues like bounding box blurring and tracking drift when the target’s appearance changes or gets occluded. To address these challenges, we propose the CLIP-based RGBT tracking algorithm TIETracker, which aims to to exploit the complementary advantages of multimodality more effec tively using textual information. Textual descriptions direct the backbone network to learn target representations in multi modality and facilitate the interaction of multi-modal features. Additionally, in scenarios of occlusion and scale transformations that lead to missing or altered target features, textual information adaptively supplements the target representation. This approach also improves the response in the image region of the target, addressing issues with bounding box accuracy and tracking drift. Our extensive evaluation on three leading RGB-Ttracking benchmarks demonstrates that TIETracker achieves competitive compared to state-of-the-art methods, effectively countering feature loss from changes in target appearance and occlusion.

13:40-13:45, Paper WeBT10.5
A New Structural Relation Extraction Framework for SAR Occluded Target Recognition (I)

Liu, Jiaxiang	Northwestern Polytechnical University
Liu, Zhunga	Northwestern Polytechnical University
Wang, Longfei	School of Automation, Northwestern Polytechnical University
Zhang, Zuowei	Northwestern Polytechnical University
Keywords: Recognition Abstract: Partially occluded target recognition is a pressing issue in synthetic aperture radar (SAR) target recognition. Occlusion causes the loss of crucial information, like target structure details. This paper proposes a new structural relation extraction framework to address partial occlusion. It is achieved through the tailored design of counterfactual samples synthesizing and jigsaw mutual learning (CSS-JML). The ASC model parameters have clear physical meanings, aiding in understanding local structural changes. By integrating SAR and ASC images, effective structural relationship representations are extracted, mitigating occlusion effects. The CSS module is designed to generate occluded counterfactual SAR and ASC images using pairs of target data. There is no longer a requirement for further annotation information because this new generation process is limited by recognition tasks. The JML module employs mutual learning to complete jigsaw puzzle tasks in both modalities. And in this process, we design two types of similarity constraints to facilitate the extraction of unified structural information across different modalities. The FA module interacts with recognition features, facilitating the classification and identification of partially obscured targets. Experimental results on MSTAR-based and FUSARShip-based test datasets with three occlusion patterns demonstrate the method's superiority in most occluded conditions, confirming its effectiveness in SAR occluded target recognition.

13:45-13:50, Paper WeBT10.6
VIOMA: Video-Based Intelligent Ocular Misalignment Assessment (I)

Zheng, Yang	Xidian University
Fu, Hong	The Education University of Hong Kong
Li, Ruimin	Xidian University
Lam, Carly	School of Optometry, the Hong Kong Polytechnic University
Liang, Jimin	Xidian University
Guo, Kaitai	Xidian University
Lo, Wai Lun	Hong Kong Chu Hai College
Keywords: Health Care Management, Robotics and Automation in Life Sciences, Recognition Abstract: The measurement of ocular alignment is critical for the diagnosis of strabismus. Current clinical methods for assessing ocular misalignment are subjective and frequently rely on the expertise of practitioners and the extent of patient cooperation. Computer-aided diagnosis methods in recent years have improved automation and precision of measurement, but still, fall short of the requirement of clinical practice. In this study, a video-based intelligent ocular misalignment assessment (VIOMA) system, was proposed to provide an objective, repeatable, user-friendly and highly-automated alternative modality for clinical ocular misalignment measurement, in which the automatic cover tests were performed under a control and motor unit, simultaneously the eye movements were tracked using a motion-capture module and assessed through video analysis techniques, determining the presence, type, and magnitude of eye deviation. For system evaluation, an automatic cover tests video dataset for strabismus (StrabismusACT-76) was established, which consists of data from 76 participants. The Bland-Altman plot, used to compare the results of the VIOMA system and human expert, showed a mean value of 1.26 prism diopter (PD) and a half-width of the 95% limit of agreement of ±7.17 PD. VIOMA system presented a mean absolute error of 3.04 PD in measuring the deviation magnitude, within a 5 PD error tolerance. Additionally, the system’s measurements were strongly correlated with that of video labeling with the mean value of -0.26 PD, a half-width of the 95% limit of agreement of ±3.56, and the average error of 1.31 PD. The experiment results indicated that the proposed method has the capability to offer accurate and efficient assessment of ocular misalignment. Note to Practitioners—The motivation behind this work stems from the need to develop an accurate and efficient system for automated ocular misalignment assessment. The subjectivity of manual cover test performed by examiners has led to variability in outcomes, and certain existing computerized methods have limitations in terms of automation, measurement accuracy, and applicability in clinical practice. Faced with these challenges, we proposed VIOMA, by establishing an apparatus for automatic implementation of cover tests and developing assessment algorithms based on strabismic video analysis. This system can objectively and precisely measure ocular misalignment, offering a promising practical solution for clinical intelligent diagnosis of strabismus. The VIOMA system’s potential applications are not limited to strabismus but may extend to other eye-related conditions and beyond.

13:50-13:55, Paper WeBT10.7
Embodied Escaping: End-To-End Reinforcement Learning for Robot Navigation in Narrow Environment

Zheng, Han	Shanghai Jiao Tong University
Zhang, Jiale	Shanghai Jiao Tong University
Jiang, Mingyang	Shanghai Jiao Tong University
Liu, Peiyuan	Cleanix Robotics Co., Ltd
Liu, Danni	Cleanix Robotics Co., Ltd.,
Qin, Tong	Shanghai Jiao Tong University
Yang, Ming	Shanghai Jiao Tong University
Keywords: Reactive and Sensor-Based Planning, AI-Enabled Robotics, Motion and Path Planning Abstract: Autonomous navigation is a fundamental task for robot vacuum cleaners in indoor environments. Since their core function is to clean entire areas, robots inevitably encounter dead zones in cluttered and narrow scenarios. Existing planning methods often fail to escape due to complex environmental constraints, high-dimensional search spaces, and high difficulty maneuvers. To address these challenges, this paper proposes an embodied escaping model that leverages a reinforcement learning-based policy with an efficient action mask for dead zone escaping. To alleviate the issue of the sparse reward in training, we introduce a hybrid training policy that improves learning efficiency. In handling redundant and ineffective action options, we design a novel action representation to reshape the discrete action space with a uniform turning radius. Furthermore, we develop an action mask strategy to select valid actions quickly, balancing precision and efficiency. In real-world experiments, our robot is equipped with a Lidar, IMU, and two-wheel encoders. Extensive quantitative and qualitative experiments across varying difficulty levels demonstrate that our robot can consistently escape from challenging dead zones. Moreover, our approach significantly outperforms compared path planning and reinforcement learning methods in terms of success rate and collision avoidance. A video showcasing our methodology and real-world demonstrations is available at https://youtu.be/kBaaYWGhNuE.


WeBT11	311A
Reinforcement Learning 6	Regular Session
Chair: Silvério, João	German Aerospace Center (DLR)
Co-Chair: Makarov, Dmitry	Federal Research Center “Computer Science and Control” of Russian Academy of Sciences

13:20-13:25, Paper WeBT11.1
Towards Safe and Efficient Learning in the Wild: Guiding RL with Constrained Uncertainty-Aware Movement Primitives

Padalkar, Abhishek	German Aerospace Center, Institute of Robotics and Mechatronics,
Stulp, Freek	DLR - Deutsches Zentrum Für Luft Und Raumfahrt E.V
Neumann, Gerhard	Karlsruhe Institute of Technology
Silvério, João	German Aerospace Center (DLR)
Keywords: Reinforcement Learning, Learning from Demonstration, AI-Enabled Robotics Abstract: Guided Reinforcement Learning (RL) presents an effective approach for robots to acquire skills efficiently, directly in real-world environments. Recent works suggest that incorporating hard constraints into RL can expedite the learning of manipulation tasks, enhance safety, and reduce the complexity of the reward function. In parallel, learning from demonstration (LfD) using movement primitives is a well-established method for initializing RL policies. In this paper, we propose a constrained, uncertainty-aware movement primitive representation that leverages both demonstrations and hard constraints to guide RL. By incorporating hard constraints, our approach aims to facilitate safer and sample-efficient learning, as the robot need not violate these constraints during the learning process. At the same time, demonstrations are employed to offer a baseline policy that supports exploration. Our method improves state-of-the-art techniques by introducing a projector that enables state-dependent noise derived from demonstrations while ensuring that the constraints are respected throughout training. Collectively, these elements contribute to safe and efficient learning alongside streamlined reward function design. We validate our framework through an insertion task involving a torque-controlled, 7-DoF robotic manipulator.

13:25-13:30, Paper WeBT11.2
ReDBN: An Interpretable Deep Belief Network for Fan Fault Diagnosis in Iron and Steel Production Lines (I)

Liao, Xiaoqiang	Shanghai Jiao Tong University
Wang, Dong	Shanghai Jiaotong University
Ming, Xinguo	Shanghai Jiao Tong University
Qiu, Siqi	Shanghai Jiao Tong University
Keywords: Representation Learning, Probabilistic Inference, Deep Learning Methods Abstract: Fan fault diagnosis in steelmaking production lines is critical for safety production and environmental protection. Deep neural networks (DNNs) achieve slight success for fan fault diagnosis. These models are unable to provide explanations for fan diagnostic decisions due to DNN’s opaque structure. From the perspective of neural symbolicintegration, researchers gradually pay attention to how to extract relational knowledge from DNNs to provide an explainable representation of DNN’s features learning and reasoning. This study introduces a neural–symbolic model, reverse deep belief network (ReDBN), where inter pretable logic representations (CR-rules) are derived based on the integration of confidence and rough rules so as to tackle fan uncertain diagnosis decision-making. To extract confidence rules, a k-logic restricted Boltzmann machine (k-LRBM) is deployed and evaluated. In k-LRBMs, confi dence rules can be extracted by considering the effect of k different literal clusters on neuron’s activation. Besides, this article introduces a symbolic language, termed rough rules, which can solve uncertain reasoning during fan fault diagnosis. Rough rules, assigning a belief value for at tribute variables, can represent the probability of how likely the sample belongs to specific fault labels. Verified on two fan datasets from a fan testbed and a real production site in Shanghai, ReDBN can achieve better performance than other typical models.

13:30-13:35, Paper WeBT11.3
M3PO: Massively Multi-Task Model-Based Policy Optimization

Narendra, Aditya	Moscow Institute of Physics and Technology
Makarov, Dmitry	Federal Research Center “Computer Science and Control” of Russia
Panov, Aleksandr	AIRI
Keywords: Reinforcement Learning, Deep Learning Methods, Machine Learning for Robot Control Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address the challenges of sample efficiency in single-task settings and generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on generative world models that prioritize pixel-level reconstruction, often at the cost of control-centric representations, while model-free methods such as PPO suffer from high sample complexity and limited exploration. M3PO integrates an implicit world model, trained to predict task outcomes without reconstructing observations, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This approach eliminates the bias-variance tradeoff inherent in prior methods (e.g., POME’s exploration bonuses) by using the discrepancy between model-based and model-free value estimates to guide exploration while maintaining stable policy updates via a trust-region optimizer. M3PO is introduced as an advanced alternative to existing model-based policy optimization methods.

13:35-13:40, Paper WeBT11.4
Mastering the Labyrinth Game: Efficient Multimodal Reinforcement Learning with Selective Reconstruction

Bi, Thomas	ETH Zurich
Marot, Ethan	ETH ZÜRICH
Ramachandran Venkatapathy, Aswin Karthik	ETH Zürich
D'Andrea, Raffaello	ETHZ
Keywords: Reinforcement Learning, Representation Learning, Visual Learning Abstract: In previous work, model-based reinforcement learning was applied to a real-world labyrinth game to demonstrate sample-efficient learning using world models. In this paper, we further enhance sample efficiency and autonomy by introducing selective reconstruction: instead of reconstructing the full visual observation, our approach reconstructs only the low-dimensional physical state signals (e.g., marble position and plate inclination), while still leveraging the complete visual input for decision-making. This targeted reconstruction focuses the world model on learning dynamics-relevant information, thereby reducing computational overhead and model complexity. Additionally, we incorporate prioritized experience replay to accelerate learning in newly explored regions of the maze and implement an autonomous marble reloader to eliminate manual resets. Together, these enhancements reduce the required collected experience from 5 hours to 1.5 hours while achieving comparable performance, and enable fully autonomous learning without human supervision.

13:40-13:45, Paper WeBT11.5
Decentralized but Not Compromised: Modular Architecture with Refined Observation for Multi-Agent Model-Based Reinforcement Learning

Wang, Shuqi	Zhejiang University
Liu, Meiqin	Zhejiang University
Zheng, Ronghao	Zhejiang University
Dong, Shanling	Zhejiang University
Wei, Ping	Xi'an Jiaotong University
Keywords: Reinforcement Learning, Multi-Robot Systems, Distributed Robot Systems Abstract: Multi-agent adversarial tasks such as swarm robotics and autonomous vehicle coordination, demand efficient decentralized collaboration under partial observability. While model-free multi-agent RL (MF-MARL) methods suffer from necessitating extensive environment interactions, most existing multi-agent model-based RL (MA-MBRL) methods fail to align with the Centralized Training with Decentralized Execution (CTDE) paradigm, which limits system flexibility. This paper proposes a novel modular architecture with refined observations (MARO) to achieve the CTDE paradigm by decoupling agents from the world model. Key innovations include: 1) an enhanced world model with weighted loss and history-augmented rollout for high-quality data generation; 2) a dual-stream semantic decomposition network (DSDN) that performs fine-grained decomposition of observations to refine action mapping and mitigate performance degradation from information loss. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate superior performance over opponents, validating the effectiveness and advancement of MARO.

13:45-13:50, Paper WeBT11.6
Leveraging Temporally Extended Behavior Sharing for Multi-Task Reinforcement Learning

Lee, Gawon	Seoul National University
Cho, Daesol	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Deep Learning in Grasping and Manipulation Abstract: Multi-task reinforcement learning (MTRL) offers a promising approach to improve sample efficiency and generalization by training agents across multiple tasks, enabling knowledge sharing between them. However, applying MTRL to robotics remains challenging due to the high cost of collecting diverse task data. To address this, we propose MT-Lévy, a novel exploration strategy that enhances sample efficiency in MTRL environments by combining behavior sharing across tasks with temporally extended exploration inspired by Lévy flight. MT-Lévy leverages policies trained on related tasks to guide exploration towards key states, while dynamically adjusting exploration levels based on task success ratios. This approach enables more efficient state-space coverage, even in complex robotics environments. Empirical results demonstrate that MT-Lévy significantly improves exploration and sample efficiency, supported by quantitative and qualitative analyses. Ablation studies further highlight the contribution of each component, showing that combining behavior sharing with adaptive exploration strategies can significantly improve the practicality of MTRL in robotics applications.

13:50-13:55, Paper WeBT11.7
RecoveryChaining: Learning Local Recovery Policies for Robust Manipulation

Vats, Shivam	Brown University
Jha, Devesh	Mitsubishi Electric Research Laboratories
Likhachev, Maxim	Carnegie Mellon University
Kroemer, Oliver	Carnegie Mellon University
Romeres, Diego	Mitsubishi Electric Research Laboratories
Keywords: Reinforcement Learning, Failure Detection and Recovery, Manipulation Planning Abstract: Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they often fail during deployment due to noisy actuation, partial observability and imperfect models. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional emph{nominal} options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with emph{sparse rewards}. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. We successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.

13:55-14:00, Paper WeBT11.8
MM-Geo: Multi-Scale and Multi-Positive UAV-View Geo-Localization

Ai, Pan	Meituan
Zhang, Xichen	Northeastern University
Cheng, Senmao	Meituan
Huang, Penghui	Meituan
Liu, Jiacheng	Tsinghua University
Zhai, Fengguang	Meituan
Mao, Yinian	Meituan-Dianping Group
Huang, Guoquan (Paul)	University of Delaware
Keywords: Representation Learning, Localization Abstract: UAV-view geo-localization is crucial in many applications, such as material transportation and security inspection, particularly in GPS-denied urban environments. However, most existing methods assume a known drone flight altitude and divide satellite maps into tiles that approximate the scale of drone images, which are often inapplicable to real-world UAV scenarios where flight altitudes vary. In this paper, we propose a novel UAV-view geo-localization method, termed MM-Geo, to address the aforementioned issue. In particular, we partition the satellite imagery map into tiles of uniform size and retrieve the matching tiles in real time using online drone images of smaller field-of-view (FOV) at different altitudes.To address the multi-scale problem due to the varying altitudes, we design the patch vote rerank with match attention,and to tackle the multi-positive sample issue in the continuous, the normalized infoNCE loss is incorporated to provide finer supervision during contrastive learning. The proposed MM-Geo is extensively validated on the our own large-scale urban dataset MT-UAV as well as the public datasets UAV-VisLoc, outperforming the state-of-the-art (SOTA) approaches and achieving remarkable performance in practical drone delivery operations. To benefit the community, we will release the VisLoc-related code at: https://github.com/MM-Geo-2025/MM-Geo.


WeBT12	311B
Robotic Imitation Learning 2	Regular Session
Co-Chair: Ha, Taeoh	Samsung Electronics AI Center

13:20-13:25, Paper WeBT12.1
SIME: Enhancing Policy Self-Improvement with Modal-Level Exploration

Jin, Yang	Shanghai Jiao Tong University
Lv, Jun	Shanghai Jiao Tong University
Yu, Wenye	Shanghai Jiao Tong University
Fang, Hongjie	Shanghai Jiao Tong University
Li, Yong-Lu	Shanghai Jiao Tong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Imitation Learning, Learning from Experience, Deep Learning in Grasping and Manipulation Abstract: Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at https://ericjin2002.github.io/SIME.

13:25-13:30, Paper WeBT12.2
Knowledge-Driven Imitation Learning: Enabling Generalization across Diverse Conditions

Miao, Zhuochen	Shanghai Jiao Tong University
Lv, Jun	Shanghai Jiao Tong University
Fang, Hongjie	Shanghai Jiao Tong University
Jin, Yang	Shanghai Jiao Tong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation Abstract: Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on knowledge-driven.github.io.

13:30-13:35, Paper WeBT12.3
CLAP: A Closed-Loop Diffusion Transformer Action Foundation Model for Robotic Manipulation

Li, Mu	South China University of Technology
Dong, Yubo	South China University of Technology
Zhou, Yang	The Chinese University of Hong Kong
Yang, Chenguang	University of Liverpool
Keywords: Imitation Learning, AI-Enabled Robotics, Manipulation Planning Abstract: The development of large Vision-Language-Action (VLA) models has enhanced the robot's ability to manipulate objects in unseen scenarios based on language instructions. While existing VLAs have demonstrated promise in various scenarios, they still struggle with effective multi-modal data feature extraction and lack a closed-loop inference framework. In this paper, we propose an advanced VLA model. Unlike previous works that repurpose VLM for action prediction using simple action quantization, we componentized the VLA architecture with a specialized action module conditioned on the model output and a critic module for inference. We demonstrate the performance improvement of diffusion action transformers in modeling continuous temporal actions, with the critic module applied during inference to form a closed-loop model. Extensive experiments on real robots demonstrate that our model significantly outperforms existing methods, with the ability to handle complex, high-precision tasks and generalize to unseen objects and backgrounds.

13:35-13:40, Paper WeBT12.4
Out-Of-Distribution Recovery with Object-Centric Keypoint Inverse Policy for Visuomotor Imitation Learning

Gao, Jiayuan	University of Pennsylvania
Li, Tianyu	University of Pennsylvania
Figueroa, Nadia	University of Pennsylvania
Keywords: Imitation Learning, Failure Detection and Recovery, Continual Learning Abstract: We propose an object-centric recovery (OCR) framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from the object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of 77.7% over the base policy in OOD. Furthermore, we show OCR's capacity to autonomously collect demonstrations for continual learning. Overall, we believe this framework represents a step toward improving the robustness of visuomotor policies in real-world settings. Project Website: https://sites.google.com/view/ocr-penn

13:40-13:45, Paper WeBT12.5
Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation

Zhu, Yichen	Midea Group
Feng, Feifei	Midea Group
Keywords: Imitation Learning, Learning from Demonstration Abstract: Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

13:45-13:50, Paper WeBT12.6
A Simple Approach to Constraint-Aware Imitation Learning with Application to Autonomous Racing

Cao, Shengfan	UC Berkeley
Joa, Eunhyek	Zoox
Borrelli, Francesco	University of California, Berkeley
Keywords: Imitation Learning, Optimization and Optimal Control, Autonomous Vehicle Navigation Abstract: Guaranteeing constraint satisfaction is challenging in imitation learning (IL), particularly in tasks that require operating near a system's handling limits. Traditional IL methods, such as Behavior Cloning (BC), often struggle to enforce constraints, leading to suboptimal performance in high-precision tasks. In this paper, we present a simple approach to incorporating safety into the IL objective. Through simulations, we empirically validate our approach on an autonomous racing task with both full-state and image feedback, demonstrating improved constraint satisfaction and greater consistency in task performance compared to BC.

13:50-13:55, Paper WeBT12.7
CDP: Constrained Diffusion Policies with Mirror Diffusion Model for Safety-Assured Imitation Learning

Ha, Taeoh	Samsung Electronics AI Center
Cha, Hyunsoo	Samsung Electronics
Ji, Daehyun	Samsung Advanced Institute of Technology
Keywords: Imitation Learning, Autonomous Vehicle Navigation, Collision Avoidance Abstract: This paper presents a novel imitation learning framework, called constrained diffusion policy (CDP). The primary objective of CDP is to ensure that learned policies strictly adhere to safety constraints while imitating expert demonstrations. To achieve this, we define a polytopic constraint that represents the safe boundary for obstacle-free region. We introduce a novel mirror map and its inverse function to incorporate a generalized polytopic constraint manifold into the mirror diffusion model. By mapping sampled data onto a constrained manifold, the mirror diffusion model generates actions that satisfy safety constraints. This approach successfully addresses the safety issues commonly encountered in conventional imitation learning models. We apply the proposed framework to mobile navigation tasks in robotics, using the Isaac Gym simulator and the Unitree Go2 quadrupedal robot. Experimental results demonstrate that the proposed framework can successfully train policies that imitate expert behaviors while strictly maintaining safety constraints, thereby achieving safety-assured imitation learning.

13:55-14:00, Paper WeBT12.8
Elastic Motion Policy: An Adaptive Dynamical System for Robust and Efficient One-Shot Imitation Learning

Li, Tianyu	University of Pennsylvania
Sun, Sunan	University of Pennsylvania
Aditya, Shubhodeep Shiv	University of Pennsylvania
Figueroa, Nadia	University of Pennsylvania
Keywords: Learning from Demonstration, Imitation Learning, Machine Learning for Robot Control Abstract: Behavior cloning (BC) has become a staple imitation learning paradigm in robotics due to its ease of teaching robots complex skills directly from expert demonstrations. However, BC suffers from an inherent generalization issue. To solve this, the status quo solution is to gather more data. Yet, regardless of how much training data is available, out-of-distribution performance is still sub-par, lacks any formal guarantee of convergence and success, and is incapable of allowing and recovering from physical interactions with humans. These are critical flaws when robots are deployed in ever-changing human-centric environments. Thus, we propose Elastic Motion Policy (EMP), a one-shot imitation learning framework that allows robots to adjust their behavior based on the scene change while respecting the task specification. Trained from a single demonstration, EMP follows the dynamical systems paradigm where motion planning and control are governed by first-order differential equations with convergence guarantees. We leverage Laplacian editing in full end-effector space, mathbb{R}^3times SO(3), and online convex learning of Lyapunov functions, to adapt EMP online to new contexts, avoiding the need to collect new demonstrations. We extensively validate our framework in real robot experiments, demonstrating its robust and efficient performance in dynamic environments, with obstacle avoidance and multi-step task capabilities. Project Website https://elastic-motion-policy.github.io/EMP/


WeBT13	311C
Deep Learning for Visual Perception 6	Regular Session
Chair: Wu, Changshun	Université Grenoble Alpes
Co-Chair: Behley, Jens	University of Bonn

13:20-13:25, Paper WeBT13.1
RainforestDepth: Monocular Depth Estimation Targeting Rainforest Environments

Tangellapalli, Srisai Anirudh	University of Nebraska-Lincoln
Peschel, Joshua	Iowa State University
Duncan, Brittany	University of Nebraska, Lincoln
Keywords: Deep Learning for Visual Perception, Vision-Based Navigation, Aerial Systems: Perception and Autonomy Abstract: The primary objective of this paper is to introduce a new monocular depth estimation (MDE) model targeting under-represented environments using a novel dataset combining synthetic and real images. The proposed model is small and fast to allow use for UAS navigation and data collection in rainforest environments. Prior works on MDEs target outdoor environments while focusing on urban, ground-level viewpoints due to interest in self-driving or autonomous package delivery applications and data availability. However, under-represented environments, such as rainforests, can benefit from targeted, environment-specific MDEs because existing general MDEs cannot adapt to extreme environmental differences, leading to high error rates. Our model is trained using a distinct rainforest dataset that combines images generated using a synthetic dataset pipeline and depth images collected from aerial robot deployments in the Children's Eternal Rainforest in Costa Rica. The proposed model will allow for improved rainforest navigation without using expensive LIDAR sensors and can improve the navigation of a UAS in rainforest environments by providing more accurate and useful measurements for object manipulation, such as leaf sampling. Our model outperforms MiDaS across the board and has over a 75% improvement, specifically in the relative error metrics, while maintaining a low runtime. The resulting model matches the performance of state-of-the-art monocular depth estimation models designed for common environments, i.e., urban and indoor environments, and outperforms them when used in a rainforest environment.

13:25-13:30, Paper WeBT13.2
RDN: An Efficient Denoising Network for 4D Radar Point Clouds

Huang, Ningyuan	Northeastern University
Li, Zhiheng	Northeastern University
Pang, Chenglin	Northeastern University
Fang, Zheng	Northeastern University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, SLAM Abstract: Accurate point cloud information is important for robot perception and autonomous driving. Although advanced 4D radar can provide point cloud with higher resolution than 3D radar, its data still contains a significant amount of noise due to measurement principle. To solve this issue, we propose RDN (Radar Denoising Network), a denoising network specifically designed for 4D radar. RDN includes three innovative modules: First, to overcome the noisy nature of radar points, we design a feature similarity-based farthest point sampling module (FS- FPS), which can extract representative sampling points from the noisy point cloud. Secondly, to address feature propagation issues caused by the sparse and long-range characteristics of 4D radar points, we introduce a virtual feature point prediction (VFP) module and an iterative upsampling (IUS) module. The VFP module generates virtual feature points through the network to serve as bridges for information transmission, while the IUS module uses an iterative approach to gradually refine feature propagation. The experiments on MSC-RAD4D and NTU4DRadLM datasets demonstrate the effectiveness and generalization of our method. Besides, odometry experiments prove the practical value of point cloud denoising in improving robot perception.

13:30-13:35, Paper WeBT13.3
Mitigating Hallucinations in YOLO-Based Object Detection Models: A Revisit to Out-Of-Distribution Detection

He, Weicheng	Université Grenoble Alpes
Wu, Changshun	Université Grenoble Alpes
Cheng, Chih-Hong	Chalmers University of Technology
Huang, Xiaowei	University of Liverpool
Bensalem, Saddek	University Grenoble
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Robot Safety Abstract: Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around~13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: url{https://gricad-gitlab.univ-grenoble-alpes.fr/dnn-safet y/m-hood}.

13:35-13:40, Paper WeBT13.4
Efficient Multi-Modal 3D Object Detector Via Instance Level Contrastive Distillation

Su, Zhuoqun	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Jiao, Shuaifeng	National University of Defense Technology
Xiao, Junhao	National University of Defense Technology
Wang, Yaonan	Hunan University
Chen, Xieyuanli	National University of Defense Technology
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization Abstract: Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, presents significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained modality consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving better efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.

13:40-13:45, Paper WeBT13.5
Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps

Pham, Khanh Son, Khanh Son Pham	Technical University Munich, Cariad SE
Witte, Christian	CARIAD SE
Behley, Jens	University of Bonn
Betz, Johannes	Technical University of Munich
Stachniss, Cyrill	University of Bonn
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Autonomous Vehicle Navigation Abstract: Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.

13:45-13:50, Paper WeBT13.6
CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection

Liu, Chupeng	University of Sydney
Zhao, Runkai	University of Sydney
Cai, Weidong	University of Sydney
Keywords: Deep Learning for Visual Perception, Deep Learning Methods Abstract: Weakly supervised monocular 3D detection, while less annotation-intensive, often struggles to capture the global context required for reliable 3D reasoning. Conventional label-efficient methods focus on object-centric features, neglecting contextual semantic relationships that are critical in complex scenes. In this work, we propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm. Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model. This alignment encourages the monocular encoder to discriminate scene-specific attributes and acquire richer contextual knowledge. In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into the monocular encoder while preserving spatial fidelity and maintaining computational efficiency during inference. Extensive experiments conducted on the public KITTI benchmark demonstrate the effectiveness of our approach, surpassing the SoTA method over all metrics, highlighting the importance of contextual-aware knowledge in weakly-supervised monocular 3D detection. For implementation details: https://github.com/AustinLCP/CAW3D.git

13:50-13:55, Paper WeBT13.7
TeX-NeRF: Neural Radiance Fields for Novel HADAR View Synthesis

Zhong, Chonghao	Beijing Institute of Technology
Xu, Chao	Beijing Institute of Technology
Hao, Rihua	Beijing Institute of Technology
Zhao, Hao	Tsinghua University
Keywords: Deep Learning for Visual Perception Abstract: Most existing NeRF methods rely on RGB images, making them unsuitable for scenarios with darkness, low light, or adverse weather conditions. To address this limitation, we propose TeX-NeRF, a NeRF framework based on heat sensing, designed for a new task: novel HADAR view synthesis. Our approach leverages Pseudo-TeX Vision to effectively transform heat sensing images through a structured mapping process. We introduce a loss function tailored to the transformed representation and incorporate temperature gradient embedding to enhance the capture of thermal information. Additionally, we construct 3D-TeX, a high-quality heat sensing dataset, to validate our method. Extensive experiments demonstrate that TeX-NeRF significantly improves pose estimation success rates for heat sensing images and outperforms existing approaches in novel HADAR view synthesis.

13:55-14:00, Paper WeBT13.8
Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation

Song, Seungheon	Kookmin University
Lee, Jaekoo	Kookmin University
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, AI-Based Methods Abstract: In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space--which provides rich linguistic knowledge--remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model's encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representaitons. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language–based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.


WeBT14	311D
Learning from Demonstration 2	Regular Session
Chair: Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Co-Chair: Liang, Xiao	Nankai University

13:20-13:25, Paper WeBT14.1
Scalable Learning of High-Dimensional Demonstrations with Composition of Linear Parameter Varying Dynamical Systems

Agrawal, Shreenabh	Indian Institute of Science, Bangalore
Kussaba, Hugo Tadashi	University of Brasília
Chen, Lingyun	Technical University of Munich
Binny, Allen Emmanuel	Indian Institute of Technology Kharagpur
Jagtap, Pushpak	Indian Institute of Science
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Learning from Demonstration, Imitation Learning, Machine Learning for Robot Control Abstract: Learning from Demonstration (LfD) techniques enable robots to learn and generalize tasks from user demonstrations, eliminating the need for coding expertise among end-users. One established technique to implement LfD in robots is to encode demonstrations in a stable Dynamical System (DS). However, finding a stable dynamical system entails solving an optimization problem with bilinear matrix inequality (BMI) constraints, a non-convex problem which, depending on the number of scalar constraints and variables, demands significant computational resources and is susceptible to numerical issues such as floating-point errors. To address these challenges, we propose a novel compositional approach that enhances the applicability and scalability of learning stable DSs with BMIs.

13:25-13:30, Paper WeBT14.2
Adversarial Augmentation for Task-Parameterized Underwater Skill Learning Via Digital Twins

Tu, Zhangpeng	Zhejiang University
Xing, Zilin	State Key Laboratory of Fluid Power and Mechatronic Systems of Z
Wu, Xin	Zhejiang University
Zhang, Suohang	Zhejiang University
Yang, Canjun	Zhejiang University
Keywords: Learning from Demonstration, Human-Robot Teaming, Marine Robotics Abstract: Learning from Demonstration (LfD) provides an efficient approach to acquiring diverse underwater skills, with task-parameterized learning enhancing the generalization of policies. However, collecting comprehensive underwater demonstrations across various conditions remains a significant challenge. In this work, we propose an adversarial trajectory augmentation method for Task Parameterized Hidden Semi-Markov Models (TP-HSMM) based on digital twins, inspired by adversarial example generation. Our method aims to improve the performance of motion policies by utilizing adversarial trajectory generation and retraining, leveraging low-cost demonstrations from digital twins. We evaluate the proposed adversarial trajectory augmentation method on two datasets. Comparative experiments demonstrate its effectiveness in reducing trajectory generation errors in new scenarios. Finally, we validate the method through an underwater humanoid plugging experiment, showing that it achieves similar performance to the baseline with fewer demonstrations.

13:30-13:35, Paper WeBT14.3
Understanding and Imitating Human-Robot Motion with Restricted Visual Fields

Bhatt, Maulik	University of California, Berkeley
Zhen, Honghao	Stanford University
Kennedy, Monroe	Stanford University
Mehr, Negar	University of California Berkeley
Keywords: Learning from Demonstration, Sensor-based Control, Motion and Path Planning Abstract: When working around other agents such as humans, it is important to model their perception capabilities to predict and make sense of their behavior. In this work, we consider agents whose perception capabilities are determined by their limited field of view, viewing range, and the potential to miss objects within their viewing range. By considering the perception capabilities and observation model of agents independently from their motion policy, we show that we can better predict the agents' behavior; i.e., by reasoning about the perception capabilities of other agents, one can better make sense of their actions. We perform a user study where human operators navigate a cluttered scene while scanning the region for obstacles with a limited field of view and range. We show that by reasoning about the limited observation space of humans, a robot can better learn a human's strategy for navigating an environment and navigate with minimal collision with dynamic and static obstacles. We also show that this learned model helps it successfully navigate a physical hardware vehicle in real-time.

13:35-13:40, Paper WeBT14.4
Enhanced Robotic Navigation in Deformable Environments Using Learning from Demonstration and Dynamic Modulation

Chen, Lingyun	Technical University of Munich
Zhao, Xinrui	Technical University of Munich
de Souza Campanha, Marcos Paulo	Technical University of Munich
Wegener, Alexander	Technical University of Munich
Naceri, Abdeldjallil	Technical University of Munich
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Learning from Demonstration, Motion and Path Planning Abstract: This paper presents a novel approach for robot navigation in environments containing deformable obstacles. By integrating Learning from Demonstration (LfD) with Dynamical Systems (DS), we enable adaptive and efficient navigation in complex environments where obstacles consist of both soft and hard regions. We introduce a dynamic modulation matrix within the DS framework, allowing the system to distinguish between traversable soft regions and impassable hard areas in real-time, ensuring safe and flexible trajectory planning. We validate our method through extensive simulations and robot experiments, demonstrating its ability to navigate deformable environments. Additionally, the approach provides control over both trajectory and velocity when interacting with deformable objects, including at intersections, while maintaining adherence to the original DS trajectory and dynamically adapting to obstacles for smooth and reliable navigation.

13:40-13:45, Paper WeBT14.5
LAMPS: A Novel Robot Generalization Framework for Learning Adaptive Multi-Periodic Skills

Liu, Zezhi	Nankai University
Luo, Hanqian	Nankai University, the Hong Kong Polytechnic University
Liang, Xiao	Nankai University
Fang, Yongchun	Nankai University
Keywords: Learning from Demonstration, Manipulation Planning, Motion Control Abstract: Learning from Demonstrations (LfD) methods are applied to transfer human skills to robots from expert demonstrations, enabling them to perform complex tasks. However, existing methods often struggle to handle such long-horizon human skills as cleaning or wiping stains on the surface, which involve multiple periodic and transitional movement primitives. To address this limitation, this paper proposes a novel framework for segmenting, learning, and generalizing multi-periodic human skills, enabling robots to effectively learn different movement primitives and execute these skills in new environments. Specifically, the framework introduces an unsupervised learning method to segment long-horizon human demonstrations into periodic and discrete movement primitives. Further, a novel type of discrete dynamical movement primitives, namely transitional movement primitives, is employed to enhance the fluidity of combining different periodic movement primitives in skills. These primitives collectively form a lightweight state machine during task execution, where state transitions are governed by visual perception, thereby enabling generalization to long-horizon tasks composed of arbitrary numbers of periodic subtasks. To validate the effectiveness of the proposed approach, we conduct extensive experimental evaluations, including step-by-step validation of each method in simulation and the implementation of the entire presented framework in the real world. The results confirm that the proposed framework accurately learns and generalizes multi-periodic human skills, providing a feasible solution for transferring complex multi-periodic demonstrations to robots in practical applications.

13:45-13:50, Paper WeBT14.6
ManiDP: Manipulability-Aware Diffusion Policy for Posture-Dependent Bimanual Manipulation

Li, Zhuo	The Chinese University of Hong Kong
Liu, Junjia	The Chinese University of Hong Kong
Li, Dianxi	The Chinese University of Hong Kong
Teng, Tao	The Chinese University of Hong Kong & Hong Kong Centre for Logis
Li, Miao	Wuhan University
Calinon, Sylvain	Idiap Research Institute
Caldwell, Darwin G.	Istituto Italiano Di Tecnologia
Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Keywords: Learning from Demonstration, Bimanual Manipulation Abstract: Recent work has demonstrated the potential of diffusion models in robot bimanual skill learning. However, existing methods ignore the learning of posture-dependent task features, which are crucial for adapting dual-arm configurations to meet specific force and velocity requirements in dexterous bimanual manipulation. To address this limitation, we propose Manipulability-Aware Diffusion Policy (ManiDP), a novel imitation learning method that not only generates plausible bimanual trajectories, but also optimizes dual-arm configurations to better satisfy posture-dependent task requirements. ManiDP achieves this by extracting bimanual manipulability from expert demonstrations and encoding the encapsulated posture features using Riemannian-based probabilistic models. These encoded posture features are then incorporated into a conditional diffusion process to guide the generation of task-compatible bimanual motion sequences. We evaluate ManiDP on six real-world bimanual tasks, where the experimental results demonstrate a 39.33% increase in average manipulation success rate and a 0.45 improvement in task compatibility compared to baseline methods. This work highlights the importance of integrating posture-relevant robotic priors into bimanual skill diffusion to enable human-like adaptability and dexterity.

13:50-13:55, Paper WeBT14.7
Cross-Embodiment Robotic Manipulation Synthesis Via Guided Demonstrations through CycleVAE and Human Behavior Transformer

Dastider, Apan	University of Central Florida
Fang, Hao	University of Central Florida
Mingjie, Lin	University of Central Florida
Keywords: Learning from Demonstration, Dexterous Manipulation, Deep Learning Methods Abstract: Cross-Embodiment Robotic Manipulation Synthesis for complicated tasks is challenging, partially due to the scarcity of paired cross-embodiment datasets and the impediment of designing intricate controllers. Inspired by robotic learning via guided human expert demonstration, we here proposed a novel cross-embodiment robotic manipulation algorithm via CycleVAE and human behavior transformer. First, we utilize unsupervised CycleVAE together with a bidirectional subspace alignment algorithm to align latent motion sequences between cross-embodiments. Second, we propose a casual human behavior transformer design to learn the intrinsic motion dynamics of human expert demonstrations. During the test case, we can leverage the proposed transformer for the human expert demonstration generation, which will be aligned using CycleVAE for the final human-robotic manipulation synthesis. We validated our proposed algorithm through extensive experiments using a dexterous robotic manipulator with the robotic hand. Our results successfully generate smooth trajectories across many intricate tasks, outperforming prior learning-based robotic motion planning algorithms. These results have implications for performing unsupervised cross-embodiment alignment and future autonomous robotics design. Complete video demonstrations of our experiments can be found in url{https://sites.google.com/view/humanrobots/home.}


WeBT15	206
Computer Vision 2	Regular Session
Chair: Fu, Changhong	Tongji University
Co-Chair: Sheng, Lu	Beihang University (BUAA)

13:20-13:25, Paper WeBT15.1
Generalizable Image Repair for Robust Visual Control

Sobolewski, Carson	University of Florida
Mao, Zhenjiang	University of Florida
Vejre, Kshitij Maruti	University of Florida
Ruchkin, Ivan	University of Florida
Keywords: Computer Vision for Transportation, Robust/Adaptive Control, Machine Learning for Robot Control Abstract: Vision-based control relies on accurate perception to achieve robustness. However, image distribution changes caused by sensor noise, adverse weather, and dynamic lighting can degrade perception, leading to suboptimal control decisions. Existing approaches, including domain adaptation and adversarial training, improve robustness but struggle to generalize to unseen corruptions while introducing computational overhead. To address this challenge, we propose a real-time image repair module that restores corrupted images before they are used by the controller. Our method leverages generative adversarial models, specifically CycleGAN and pix2pix, for image repair. CycleGAN enables unpaired image-to-image translation to adapt to novel corruptions, while pix2pix exploits paired image data when available to improve the quality. To ensure alignment with control performance, we introduce a control-focused loss function that prioritizes perceptual consistency in repaired images. We evaluated our method in a simulated autonomous racing environment with various visual corruptions. The results show that our approach significantly improves performance compared to baselines, mitigating distribution shift and enhancing controller reliability.

13:25-13:30, Paper WeBT15.2
OrchardDepth++: Binned KL-Flood Regularization for Monocular Depth Estimation of Orchard Scene

Zhichao, Zheng	The University of Auckland
Williams, Henry	University of Auckland
Gee, Trevor	The University of Auckland
MacDonald, Bruce	University of Auckland
Keywords: Computer Vision for Automation, Agricultural Automation, Robotics and Automation in Agriculture and Forestry Abstract: Monocular depth estimation is a rudimentary problem for robotic perception systems and downstream applications. However, depth estimation from a single image is an inherently ill-posed problem due to data loss related to projection from 3D to 2D. Recent studies address the discrepancy between camera parameters by using learning-based methods and unifying the camera model to canonical camera space or bipolar representations, thus addressing the problem of training a metric depth model over different datasets with different camera parameters. In addition, the previous study, OrchardDepth, introduced the sparse-dense depth consistency loss function to learn the dense depth distribution through the city autonomous driving scene to improve model performance in the orchard. Instead of enforcing strict consistency between the sparse and dense depth, this work introduced the KL divergence to encourage the network to adapt to the depth distributions of different sensors and penalize deviations from reliable regions while tolerating errors in unreliable areas. Furthermore, we further enhance the depth consistency loss by integrating bins into the supervised discretised depth distribution. This method significantly improves the robustness and performance of our previous method. In addition, it improves the absolute relative error in the orchard dataset by 17.3% and 16.2% in contrast to SILog Loss and OrchardDepth baseline, respectively. Thus enhancing the new training paradigm for depth estimation in the orchard scene.

13:30-13:35, Paper WeBT15.3
Enhancing Single Image to 3D Generation Using Gaussian Splatting and Hybrid Diffusion Priors

Basak, Hritam	Stony Brook University
Tabatabaee, Hadi	Amazon Lab126
Gayaka, Shreekant	Amazon
Li, Ming-Feng	National Tsing Hua University
Yang, Xin	Amazon
Kuo, Cheng-Hao	Amazon
Sen, Arnab	Amazon
Sun, Min	National Tsing Hua University
Yin, Zhaozheng	Stony Brook University
Keywords: Computer Vision for Automation, Deep Learning Methods, RGB-D Perception Abstract: 3D object generation from a single unposed RGB image is essential for robotic perception, as reconstructing complete geometry and texture is essential for precise manipulation, grasping, and scene understanding, which is key for autonomous navigation and dexterous interaction. Recent advancements in image-to-3D employ Gaussian Splatting with pre-trained 2D or 3D diffusion models, but a disparity exists: 2D models generate high-fidelity textures yet lack geometric consistency, while 3D models ensure structural coherence but produce overly smooth textures. To address this, we introduce a two-stage frequency-based distillation loss integrated with Gaussian Splatting, leveraging geometric priors from a 3D diffusion model’s low-frequency spectrum for structural consistency and a 2D diffusion model’s high-frequency details for sharper textures. Our approach achieves state-of-the-art 3D reconstruction quality, significantly improving robotic perception pipelines. Additionally, we demonstrate the easy adaptability of our method for highly accurate object pose estimation and tracking, which is critical for precise robotic grasping, manipulation, and scene understanding.

13:35-13:40, Paper WeBT15.4
IoU-Aware Clustering for Anchor Configuration Determination in Efficient Defect Detection

Zhao, Yuhao	Institute of Automation, Chinese Academy of Sciences
Ma, Hongxuan	Institute of Automation, Chinese Academy of Sciences
Zou, Wei	Chinese Academy of Sciences, University of Chinese Academy of Sci
Liu, Zhe	Chemical Defense Institute, Academy of Military Sciences
Su, Hu	Institute of Automation, Chinese Academy of Science
Liu, Song	ShanghaiTech University
Keywords: Computer Vision for Automation, Computer Vision for Manufacturing, Factory Automation Abstract: Deep-learning-based object detection has gained widespread application in surface defect inspection, with anchor-based detectors achieving remarkable success by utilizing dense anchors to align with defects. Determining the optimal anchor configuration, i.e., sizes and aspect ratios of anchor boxes, remains a critical challenge, particularly when addressing defects with significant shape variations. While previous studies have predominantly focused on developing more efficient network architectures and learning strategies, the problem of anchor configuration determination has not been thoroughly explored. To address this gap, this paper proposes the IoU-Aware Clustering (IAC) algorithm, which autonomously learns suitable anchor configurations by extracting shape priors from diverse defects. IAC takes the training bounding boxes as potential clustering centers and selects a subset that aligns with the shape distribution of the training samples. The algorithm involves only a single hyper-parameter, the anchor number k, making it highly adaptable to various scenarios. Experimental results demonstrate that IAC can effectively generate anchor configurations tailored to defect shapes, significantly improving the mean Average Precision (mAP) by 6.9% and 14.4% on two industrial defect datasets with substantial shape variations.

13:40-13:45, Paper WeBT15.5
MAD-GS: 3D Gaussian Splatting for Motion and Defocus Images in Robotic Vision

Zeng, Tianle	Guangdong University of Technology
Zeng, Bi	Guangdong University of Technology
Zhang, Boquan	Guangdong University of Technology
Zheng, Ziqi	Guangdong University of Technology
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Mapping Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have significantly improved novel view synthesis, playing a crucial role in robotic vision and scene reconstruction. However, 3DGS relies heavily on precise camera poses and sharp images, which are often difficult to obtain in real-world robotic applications due to motion and defocus blur. Directly applying 3DGS to blurred images results in severe degradation, limiting its effectiveness in tasks such as autonomous navigation and manipulation. To address this challenge, we propose MAD-GS, a novel deblurring framework based on 3DGS, specifically designed for robotic vision tasks. MAD-GS effectively mitigates motion and defocus blur while refining imprecise camera poses, enhancing 3D scene reconstruction under real-world uncertainties. Additionally, we introduce a blur segmentation mask to identify and adaptively refine heavily blurred regions, improving visual quality and downstream robotic decision-making. Extensive experiments on synthetic and real-world datasets demonstrate that MAD-GS outperforms existing methods, leading to superior image clarity and fidelity, thereby advancing robust robot perception in dynamic environments.

13:45-13:50, Paper WeBT15.6
Chain-Of-Imagination for Reliable Instruction Following in Decision Making

Zhou, Enshen	BeiHang University
Qin, Yiran	CUHKsz
Yin, Zhenfei	The University of Sydney
Shi, Zhelun	Beihang University
Huang, Yuzhou	The Chinese University of Hong Kong, Shenzhen
Zhang, Ruimao	The Chinese University of Hong Kong (Shenzhen)
Sheng, Lu	Beihang University (BUAA)
Shao, Jing	Shanghai AI Laboratory
Keywords: Computer Vision for Automation, Agent-Based Systems Abstract: Enabling the embodied agent to imagine step-by-step the future states and sequentially approach these situation-aware states can enhance its capability to make reliable action decisions from textual instructions. In this work, we introduce a simple but effective mechanism called Chain-of-Imagination (CoI), which repeatedly employs a Multimodal Large Language Model (MLLM) equipped with diffusion model to facilitate imagining and acting upon the series of intermediate situation-aware visual sub-goals one by one, resulting in more reliable instruction-following capability. Based on the CoI mechanism, we propose an embodied agent DecisionDreamer as the low-level controller that can be adapted to different open-world scenarios. Extensive experiments demonstrate that DecisionDreamer can achieve more reliable and accurate decision-making and significantly outperform the state-of-the art generalist agents in the Minecraft and CALVIN sandbox simulators, regarding the instruction-following capability. For more demos, please see https://sites.google.com/view/decisiondreamer.

13:50-13:55, Paper WeBT15.7
EdgeSpotter: Multi-Scale Dense Text Spotting for Industrial Panel Monitoring

Fu, Changhong	Tongji University
Lin, Hua	Tongji University
Zuo, Haobo	University of Hong Kong
Yao, Liangliang	Tongji University
Zhang, Liguo	Tongji University
Keywords: Computer Vision for Automation, AI-Based Methods, Computer Vision for Manufacturing Abstract: Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multiscale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with Catmull-Rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo are available at https://github.com/vision4robotics/EdgeSpotter.

13:55-14:00, Paper WeBT15.8
Quality and Quantity Control of Mitochondria Injection into Single Cells with Robot-Aided Micro-Manipulation System (I)

Shakoor, Adnan	King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arab
Xie, Mingyang	Nanjing University of Aeronautics & Astronautics
Gao, Wendi	Xi'an Jiaotong University
Gulzar, Muhammad Majid	King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arab
Sun, Jiayu	City University of Hong Kong
Sun, Dong	City University of Hong Kong
Keywords: Biological Cell Manipulation, Computer Vision for Automation, Micro/Nano Robots Abstract: Mitochondrial dysfunction plays a significant role in the development of fatal diseases such as aging, cancer, and Alzheimer’s. Transferring mitochondria to cells is a new and potential treatment for mitochondrial DNA (mtDNA)-related illnesses. This article describes a novel technique to control the quality and quantity of mitochondria injected into single live cells using a robot-aided microneedle and optical tweezers (OTs)-based micromanipulation system. Isolated mitochondria and cells are patterned in a 1-dimensional (1-D) array in a microfluidic device, and a robot-aided microneedle collects the predefined number of functional mitochondria with the help of OTs. Then, the microneedle precisely and non-invasively injects these collected mitochondria into single live cells. Given that the two manipulation tools of OTs and microneedle were used, a switch controller strategy is developed to enable mitochondria trapping and injection with OTs and microneedle-based micromanipulator, respectively. The effectiveness of the developed robotic system is experimentally demonstrated with automated injections of isolated mitochondria into HeLa and mesenchymal stem cells (aMSCs). A precise and efficient quality and quantity control of mitochondria injection was possible by using Ots in conjunction with microneedle and microfluidics technologies. Quality- and quantity-controlled ability is analyzed and compared with the traditional mitochondria transfer method (co-culture). Biological tests are further conducted to assess the viability of mitochondria recipient cells. Experimental results demonstrate that the developed system can non-invasively transfer healthy mitochondria into single live cells while precisely controlling the quantity and quality of injected mitochondria. The proposed mitochondrial transfer method has the potential to advance precision medicine methods, particularly for cellular therapy of (mtDNA)-related diseases


WeBT16	207
Prosthetics and Exoskeletons 2	Regular Session
Co-Chair: Shen, Yantao	University of Nevada, Reno

13:20-13:25, Paper WeBT16.1
System Design of a Soft Underwater Exosuit to Reduce Metabolic Cost across Multiple Aquatic Movements During Diving

Wang, Xiangyang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Chen, Chunjie	Shenzhen Institutes of Advanced Technology，Chinese Academ
Sun, Jianquan	Shenzhen Institutes of Advanced Technology
Du, Sida	Shenzhen Institute of Advanced Technology, CAS
Ma, Yue	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Wu, Xinyu	CAS
Keywords: Prosthetics and Exoskeletons, Human Performance Augmentation, Physically Assistive Devices, Scuba-diving Assistive Exoskeleton Abstract: Assisting underwater movements improves divers' efficiency and reduces the risk of decompression sickness from physical activity. Although exoskeletons have been developed for numerous land-based scenarios, their application in underwater diving remains unexplored. This paper proposes a soft underwater lower-limb exosuit designed to assist three aquatic movements: flutter kick, breaststroke kick, and underwater walk. We presented the mechanical design of the exosuit that is capable of assisting bidirectional leg movements in full kicking/gait cycle, while ensuring natural leg mobility without impeding normal leg function. A cascade force integral controller is also designed to resolve issues related to uncontrollable states and stiffness variations within the system. To verify the assistive performance of the system, experiments were conducted with nine participants to assess how the proposed exosuit aids in reducing metabolic cost across various motion patterns and frequencies. The findings indicate that the underwater exosuit effectively reduces the air consumption rate by 29.77 ± 7.68% during flutter kick, 25.70 ± 5.99% during breaststroke kick, and 18.35 ± 4.53% during underwat

13:30-13:35, Paper WeBT16.3
Bi-Directional Cable-Driven Ankle Exoskeleton Coupled with Series Elastic Actuator for Compliant Gait Assisting

Tu, Yao	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Song, Jiyuan	Guangming Laboratory, Guangdong Laboratory of Artificial Intelli
Zhu, Aibin	Xi'an Jiaotong University
Zhang, Bo	Shenzhen University, Shenzhen 518060, China
Yu, F. Richard	Shenzhen University
Li, Qingquan	Shenzhen University
Keywords: Prosthetics and Exoskeletons, Compliance and Impedance Control, Mechanism Design Abstract: This paper presents a lightweight bidirectional cable-driven ankle exoskeleton system (total mass: 2.6 kg) based on a series elastic actuation architecture (actuator module mass: 1.05 kg). The system utilizes a waist-mounted drive unit and Bowden cables to deliver bidirectional assistance to the ankle joint (nominal force: 460 N, peak force: 680 N). By integrating a dynamically coupled adaptive oscillator (AO), the system achieves robust gait synchronization across a range of walking speeds (0.6–1.8 m/s, phase estimation RMSE <2.48%, stride frequency estimation RMSE <0.1 Hz). This is complemented by a Gaussian Process (GP)-based torque planner and a cascaded torque control framework, ensuring seamless coordination with natural gait. Experimental characterization of the actuator demonstrates its high dynamic performance (torque bandwidth: 12.5 Hz) and low-impedance characteristics (peak passive backdrive torque: 0.97 N·m). Human trials involving five participants show that the system significantly expands the ankle joint range of motion (up to [-15.64°, 20.67°] at high speeds) while reducing peak muscle activation levels in the tibialis anterior (18.54%–30.21%) and gastrocnemius (19.34%–25.45%). This design, combining lightweight construction with adaptive control strategies, provides a highly effective solution for daily mobility assistance and rehabilitation applications.

13:35-13:40, Paper WeBT16.4
EIC Framework for Hand Exoskeletons Based on a Multimodal Large Language Model

Li, Houcheng	University of Chinese Academy of Sciences
Su, Zhenchan	China University of Geosciences Beijing
Guo, Honglei	China University of Geosciences Beijing
Wang, Yifan	Institute of Automation, Chinese Academy of Sciences
Liu, Zeyu	Institute of Automation, Chinese Acadamy of Sciences
Cheng, Long	Chinese Academy of Sciences
Keywords: Prosthetics and Exoskeletons, AI-Enabled Robotics, Intention Recognition Abstract: Current hand exoskeleton interaction methods primarily focus on recognizing a limited range of hand motion intentions and rely on pre-programmed control to execute predefined commands. However, these approaches face significant limitations when confronted with unanticipated or non-predefined scenarios, such as performing various gestures or grasping different objects. To address this challenge, this paper proposes an embodied interaction control (EIC) framework for hand exoskeletons based on a multimodal large language model (MLLM). First, an embodied interaction method leveraging multi-modal fusion of speech and image information is developed, enabling more intuitive, hands-free, accurate, and robust human-robot interaction. By utilizing multi-modal data, the MLLM infers the user's hand motion intentions and generates corresponding motion plans for the exoskeleton. The underlying control strategy is then used to execute the motion planning. Notably, leveraging the advanced reasoning and code-generation capabilities of MLLMs, the framework can generate undefined gestures and grasping actions. Finally, experimental results validate the effectiveness and generalizability of the EIC framework.

13:40-13:45, Paper WeBT16.5
Force-Sensor-Free Contact Estimation for Lower Limb Exoskeleton Robots Based on Probabilistic Modeling and Fusion

Ye, Weigen	University of Electronic Science and Technology of China
Zhang, Xinhao	University of Electronic Science and Technology of China
Jiang, Ziyi	University of Electronic Science and Technology of China
Zou, Chaobin	University of Electronic Science and Technology of China
Zhang, Jingting	University of Electronic Science and Technology of China
Song, Guangkui	University of Electronic Science and Technology of China
Cheng, Hong	University of Electronic Science and Technology
Keywords: Prosthetics and Exoskeletons, Contact Modeling, Probability and Statistical Methods Abstract: Lower limb exoskeletons (LLEs) play a crucial role in assisting paraplegic patients with walking in outdoor environments characterized by complex terrains, including various stairs, slopes, and uneven grounds. However, most existing control methods for LLEs rely on predefined joint angles, lacking the flexibility to adapt to diverse terrains. This deficiency often leads to unexpected contacts between the feet of the LLEs and the ground, thereby disrupting the walking balance of the LLEs. In this paper, a novel force-sensor-free contact estimation method is proposed to tackle this problem. This method utilizes only the sensors already present on the LLEs, eliminating the need for any additional force sensors. The proposed approach is founded on the probabilistic modeling of gait phases, knee joint torques, foot heights, and the displacement of the center of mass. Moreover, Kalman filtering is employed to enhance the contact estimation accuracy by integrating multiple probabilistic models. Experiments were carried out on both robot simulation platforms and real exoskeleton robots. The experimental results demonstrate that the proposed approach can accurately estimate contacts during walking on flat ground and stairs. Specifically, it achieves an accuracy of 99% with a time deviation of 8 ms on the flat ground and an accuracy of 95% with a time deviation of 10 ms on stairs.

13:45-13:50, Paper WeBT16.6
Hybrid Data-Driven Predictive Control for Robust and Reactive Exoskeleton Locomotion Synthesis

Li, Kejun	California Institute of Technology
Kim, Jeeseop	Caltech
Brunet, Maxime	MINES Paristech
Petriaux, Marine	Wandercraft
Yue, Yisong	California Institute of Technology
Ames, Aaron	Caltech
Keywords: Prosthetics and Exoskeletons, Motion Control, Humanoid and Bipedal Locomotion Abstract: Robust bipedal locomotion in exoskeletons requires the ability to dynamically react to changes in the environment in real time. This paper introduces the hybrid data-driven predictive control (HDDPC) framework, an extension of the data-enabled predictive control, that addresses these challenges by simultaneously planning foot contact schedules and continuous domain trajectories. The proposed framework utilizes a Hankel matrix-based representation to model system dynamics, incorporating step-to-step (S2S) transitions to enhance adaptability in dynamic environments. By integrating contact scheduling with trajectory planning, the framework offers an efficient, unified solution for locomotion motion synthesis that enables robust and reactive walking through online replanning. We validate the approach on the Atalante exoskeleton, demonstrating improved robustness and adaptability.

13:50-13:55, Paper WeBT16.7
HannesImitation: Grasping with the Hannes Prosthetic Hand Via Imitation Learning

Alessi, Carlo	Istituto Italiano Di Tecnologia
Vasile, Federico	Istituto Italiano Di Tecnologia
Ceola, Federico	Istituto Italiano Di Tecnologia
Pasquale, Giulia	Istituto Italiano Di Tecnologia
Boccardo, Nicolò	IIT - Istituto Italiano Di Tecnologia
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Keywords: Prosthetics and Exoskeletons, AI-Enabled Robotics, Grasping Abstract: Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation.

13:55-14:00, Paper WeBT16.8
Weight Regression for a Generalized Motion Primitive Formulation in Cooperative Hand Placement Tasks with Upper Limb Prosthesis

Cai, Hongjun	Johns Hopkins University
Greene, Rebecca J.	Johns Hopkins University
Hunt, Christopher	Infinite Biomedical Technologies
Thakor, Nitish V.	Johns Hopkins University, Baltimore, USA
Keywords: Prosthetics and Exoskeletons, Human and Humanoid Motion Analysis and Synthesis, Human-Robot Collaboration Abstract: Recent years have seen a growing interest in the development of shared control strategies for upper limb prostheses. In this work, we take a critical step towards developing transhumeral devices by proposing a biomimetic control strategy for cooperative hand placement. This is achieved through a novel adaptation of Dynamic Movement Primitives (DMPs), enabling the generation of smooth trajectories from a rest position to arbitrary points within a user’s reach space. Our method revolves around a key observation that DMP forcing-function weights can be well modeled (p<0.05) for > 90% of values as a simple function of Cartesian position, achieving median R^2 values of 0.63, 0.43, and 0.02 (horizontal, vertical, and depth). Validation on 519 trajectories via 5-fold cross-validation showed significant improvements (p<0.01) over Extended DMP and kernel-based methods. Real-time human-in-the-loop experiments revealed a median minimum cumulative-distance deviation of 0.0733m (8.5% error) between motion with a prosthesis as compared to that with an intact limb. To our knowledge, this is the first study to explore shared control for transhumeral prostheses, and our observations on human motion modeling may inspire future Learning-from-Demonstration (LfD) studies.


WeBT17	210A
Intelligent Transportation Systems 2	Regular Session
Chair: De Martini, Daniele	University of Oxford
Co-Chair: Liu, Jia	ShenZhen Institutes of Advanced Technology, Chinese Academy of Sciences

13:20-13:25, Paper WeBT17.1
DriveGen: Towards Infinite Diverse Traffic Scenarios with Large Models

Zhang, Shenyu	Shanghai Jiao Tong University
Tian, Jiaguo	Shanghai Jiao Tong University
Zhu, Zhengbang	Shanghai Jiao Tong University
Huang, Shan	Changan Automobile Group Co., Ltd
Yang, JuCheng	Chongqing Changan Technology Co., LTD
Zhang, Weinan	Shanghai Jiao Tong University
Keywords: AI-Based Methods, Deep Learning Methods, Intelligent Transportation Systems Abstract: Microscopic traffic simulation has become an important tool for autonomous driving training and testing. Although recent data-driven approaches advance realistic behavior generation, their learning still relies primarily on a single real-world dataset, which limits their diversity and thereby hinders downstream algorithm optimization. In this paper, we propose DriveGen, a novel traffic simulation framework with large models for more diverse traffic generation that supports further customized designs. DriveGen consists of two internal stages: the initialization stage uses large language model and retrieval technique to generate map and vehicle assets; the rollout stage outputs trajectories with selected waypoint goals from visual language model and a specific designed diffusion planner. Through this two-staged process, DriveGen fully utilizes large models' high-level cognition and reasoning of driving behavior, obtaining greater diversity beyond datasets while maintaining high realism. To support effective downstream optimization, we additionally develop DriveGen-CS, an automatic corner case generation pipeline that uses failures of the driving algorithm as additional prompt knowledge for large models without the need for retraining or fine-tuning. Experiments show that our generated scenarios and corner cases have a superior performance compared to state-of-the-art baselines. Downstream experiments further verify that the synthesized traffic of DriveGen provides better optimization of the performance of typical driving algorithms, demonstrating the effectiveness of our framework.

13:25-13:30, Paper WeBT17.2
CoDifFu: Diffusion-Based Collaborative Perception with Efficient Heterogeneous Feature Fusion

Meng, Zeyu	Xi'an Jiaotong University
Song, Yonghong	Xi'an Jiaotong University
Zhang, Yuanlin	Xi'an JiaoTong University
Bai, Zenan	Xi 'an Jiaotong University
Duanjiayi, Duanjiayi	Xian Jiaotong University
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Computer Vision for Transportation Abstract: Multi-agent collaborative perception is currently experiencing a surge in attention as a novel approach to addressing autonomous driving challenges. Despite advances in previous efforts, challenges remain due to various dilemmas in the perception process, such as imperfect localization and collaboration heterogeneity. To tackle these issues, we propose CoDifFu, a novel diffusion-based collaborative perception framework that enhances robustness against localization uncertainty and improves efficiency in heterogeneous feature fusion. A diffusion-based detection head progressively denoises object centers through a learnable reverse process. During training, the center coordinates of objects diffuse from the ground truth to the Gaussian distribution, then the network learns to reverse the diffusion process. In the inference, the model progressively refines a set of random centers of boxes to align with the ground truth centers. Moreover, we devised a confidence-guided multi-agent communication module(CMC), utilizing the confidence map as guidance to effectively achieve complementary feature fusion of multi-agent's features and alleviates collaboration heterogeneity. To thoroughly evaluate CoDifFu, we consider 3D object detection in both real-world and simulation scenarios. Extensive experiments demonstrate the superiority of CoDifFu and the effectiveness of all its vital components. The code will be released.

13:30-13:35, Paper WeBT17.3
Enhancing Autonomous Driving Safety with Collision Scenario Integration

Wang, Zi	Carnegie Mellon University
Lan, Shiyi	NVIDIA
Sun, Xinglong	Stanford & UIUC
Chang, Nadine	Nvidia
Li, Zhenxin	Fudan University, NVIDIA
Yu, Zhiding	NVIDIA
Alvarez, Jose	NVIDIA
Keywords: Intelligent Transportation Systems, Motion and Path Planning, Robot Safety Abstract: Autonomous vehicle safety is crucial for the successful deployment of self-driving cars. However, most existing planning methods rely heavily on imitation learning, which limits their ability to leverage collision data effectively. Moreover, collecting collision or near-collision data is inherently challenging, as it involves risks and raises ethical and practical concerns. In this paper, we propose SafeFusion, a training framework to learn from collision data. Instead of over-relying on imitation learning, Safefusion integrates safety-oriented metrics during training to enable collision avoidance learning. In addition, to address the scarcity of collision data, we propose CollisionGen, a scalable data generation pipeline to generate diverse, high-quality scenarios using natural language prompts, generative models, and rule-based filtering. Experimental results show that our approach improves planning performance in collision-prone scenarios by 56% over previous state-of-the-art planners while maintaining effectiveness in regular driving situations. Our work provides a scalable and effective solution for advancing the safety of autonomous driving systems.

13:35-13:40, Paper WeBT17.4
Cross-Level Fusion: Integrating Object Lists with Raw Sensor Data for 3D Object Tracking

Liu, Xiangzhong	Fortiss GmbH, Research Institute of the Free State of Bavaria Fo
Wang, Xihao	Technische Universität München
Shen, Hao	Technische Universität München
Keywords: Intelligent Transportation Systems, Sensor Fusion, Deep Learning for Visual Perception Abstract: Smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized in automotive perception systems, which primarily provide processed object lists rather than raw data. However, high-level fusion approaches suffer from significant information loss and representational misalignment due to the inherently abstract and sparse nature of these high-level outputs. We propose a novel cross-level fusion paradigm that enables bidirectional information flow between object lists and raw vision features within an end-to-end Transformer framework for 3D object detection and tracking. Our approach extracts inherent positional and dimensional cues from object lists to generate two outputs: structured query features that are fused with the initial learnable queries in the Transformer decoder, and soft Gaussian attention masks that guide feature extraction. This integrated mechanism not only improves tracking accuracy by synergistically combining object priors with fine-grained vision data but also promotes hardware economy and AI model sustainability by adapting legacy sensors to evolving sensor setups. To overcome the lack of dedicated datasets, we develop a pseudo object list generation pipeline that simulates realistic sensor tracking behavior. Experiments on the nuScenes dataset demonstrate significant performance gains over vision-only baselines and robust generalization across diverse noise levels, validating the efficacy of our cross-level fusion strategy. The code is available at: https://github.com/CesarLiu/DNF.git.

13:40-13:45, Paper WeBT17.5
MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

Raskoti, Chandra	University of Tennessee, Knoxville
Islam, Md Iftekharul	University of Tennessee, Knoxville
Wang, Xuan	George Mason University
Li, Weizi	University of Tennessee, Knoxville
Keywords: Intelligent Transportation Systems, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors-such as acceleration, deceleration, and left and right maneuvers-pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short- and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.

13:45-13:50, Paper WeBT17.6
CooperRisk: A Driving Risk Quantification Pipeline with Multi-Agent Cooperative Perception and Prediction

Lei, Mingyue	Tongji University
Zhou, Zewei	University of California, Los Angeles
Li, Hongchen	Tongji University
Hu, Jia	Tongji University
Ma, Jiaqi	University of California, Los Angeles
Keywords: Intelligent Transportation Systems, Vision-Based Navigation Abstract: Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single-vehicle systems in complex and dense scenarios. Vehicle-to-everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi-agent interaction with V2X remains an open question. In this paper, we introduce the first V2X-enabled risk quantification pipeline, CooperRisk, to fuse perception information from multiple agents and quantify the scenario driving risk in future multiple timestamps. The risk is represented as a scenario risk map to ensure interpretability based on risk severity and exposure, and the multi-agent interaction is captured by the learning-based cooperative prediction model. We carefully design a risk-oriented transformer-based prediction model with multi-modality and multi-agent considerations. It aims to ensure scene-consistent future behaviors of multiple agents and avoid conflicting predictions that could lead to overly conservative risk quantification and cause the ego vehicle to become overly hesitant to drive. Then, the temporal risk maps could serve to guide a model predictive control planner. We evaluate the CooperRisk pipeline in a real-world V2X dataset V2XPnP, and the experiments demonstrate its superior performance in risk quantification, showing a 44.35% decrease in conflict rate between the ego vehicle and background traffic participants.

13:50-13:55, Paper WeBT17.7
VTD: Visual and Tactile Dataset for Driver State and Behavior Detection

Wang, Jie	Tongji University, China
Cai, Mobing	University of Oxford
Zhu, Zhongpan	University of Shanghai for Science and Technology
Ding, Hongjun	CATARC Automotive Technology (Shanghai) Co., LTD
Yi, Jiwei	Tongji University
Du, Aimin	Tongji University
Keywords: Intelligent Transportation Systems, Human Factors and Human-in-the-Loop, Human Detection and Tracking Abstract: In the domain of autonomous vehicles, the human-vehicle co-pilot system has garnered significant research attention. To address the subjective uncertainties in driver state and interaction behaviors, which are pivotal to the safety of Human-in-the-loop co-driving systems, we introduce a novel visual-tactile detection method. Utilizing a driving simulation platform, a comprehensive dataset has been developed that encompasses multi-modal data under fatigue and distraction conditions. The experimental setup integrates driving simulation with signal acquisition, yielding 600 minutes of driver state and behavior data from 15 subjects and 102 takeover experiments with 17 drivers. The dataset, synchronized across modalities, serves as a robust resource for advancing cross-modal driver behavior detection algorithms.

13:55-14:00, Paper WeBT17.8
The Oxford RobotCycle Project: A Multimodal Urban Cycling Dataset for Assessing the Safety of Vulnerable Road Users (I)

Panagiotaki, Efimia	University of Oxford
Thuremella, Divya	University of Oxford, Robotics Institute
Baghabrah, Jumana	University of Oxford
Sze, Samuel	University of Oxford
Fu, Lanke Frank Tarimo	University of Oxford
Hardin, Benjamin	University of Oxford
Reinmund, Tyler	University of Oxford
Flatscher, Tobit	Oxford Robotics Institute, Department of Engineering Science, Un
Marques, Daniel	University of Oxford
Prahacs, Chris	University of Oxford
Kunze, Lars	UWE Bristol
De Martini, Daniele	University of Oxford
Keywords: Datasets for Human Motion, Intelligent Transportation Systems, Safety in HRI Abstract: The Oxford RobotCycle Project is a novel initiative aiming to understand how road and traffic infrastructure influence road users’ behavior, affecting cyclists’ journeys and safety. By leveraging state-of-the-art technology and methods used in autonomous vehicles (AVs), this project introduces a novel multimodal dataset, capturing dynamic cycling data in complex and diverse urban traffic environments. The dataset consists of range, visual, and inertial sensors, mounted on a backpack, and eye gaze tracking glasses, coupled with an analysis of road infrastructure and interactions with other road users. Enhanced by annotated maps, reconstructed 3-D point clouds, and a detailed ontology capturing static and dynamic agents and their relations, the dataset provides a comprehensive framework for analyzing and understanding traffic dynamics. Heatmaps derived from the cyclists’ vision reveal attention patterns and focal points during various traffic scenarios. We also analyze traffic interactions and risk, either perceived or actual, and correlate them with road infrastructure and traffic volumes. To complement the dataset, we also provide a complete set of tools for risk and traffic analysis, visualization, automatic calibration, and data annotation. The dataset can also be used to evaluate the robustness of odometry estimation methods, due to the highly dynamic cyclist movements. Combining multimodal data with traffic and risk analysis, the Oxford RobotCycle Project facilitates identifying safety-critical scenarios to derive actionable insights for safer, cyclist-friendly road design. This work contributes toward improving cycling safety, enhancing urban mobility, and supporting sustainable transportation initiatives. The dataset and tools are made available at https://ori-mrg.github.io/robotcycle-dataset/


WeBT18	210B
Multi-Modular Robot Systems 1	Regular Session
Chair: Guo, Meng	Peking University
Co-Chair: Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen

13:20-13:25, Paper WeBT18.1
DEXTER-LLM: Dynamic and Explainable Coordination of Multi-Robot Systems in Unknown Environments Via Large Language Models

Zhu, Yuxiao	Duke Kunshan University
Chen, Junfeng	Peking University
Zhang, Xintong	Duke Kunshan University
Guo, Meng	Peking University
Li, Zhongkui	Peking University
Keywords: Multi-Robot Systems, Planning, Scheduling and Coordination, AI-Based Methods Abstract: Online coordination of multi-robot systems in open and unknown environments faces significant challenges, particularly when semantic features detected during operation dynamically trigger new tasks. Recent large language model (LLMs)-based approaches for scene reasoning and planning primarily focus on one-shot, end-to-end solutions in known environments, lacking both dynamic adaptation capabilities for online operation and explainability in the processes of planning. To address these issues, a novel framework (DEXTER-LLM) for dynamic task planning in unknown environments, integrates four modules: (i) a mission comprehension module that resolves partial ordering of tasks specified by natural languages or linear temporal logic formulas (LTL); (ii) an online subtask generator based on LLMs that improves the accuracy and explainability of task decomposition via multi-stage reasoning; (iii) an optimal subtask assigner and scheduler that allocates subtasks to robots via search-based optimization; and (iv) a dynamic adaptation and human-in-the-loop verification module that implements multi-rate, event-based updates for both subtasks and their assignments, to cope with new features and tasks detected online. The framework effectively combines LLMs' open-world reasoning capabilities with the optimality of model-based assignment methods, simultaneously addressing the critical issue of online adaptability and explainability. Experimental evaluations demonstrate exceptional performances, with~100% success rates across all scenarios, 160 tasks and 480 subtasks completed on average (3 times the baselines), 62% less queries to LLMs during adaptation, and superior plan quality (2 times higher) for compound tasks. Project page at url{https://tcxm.github.io/DEXTER-LLM/}.

13:25-13:30, Paper WeBT18.2
FedEMA: Federated Exponential Moving Averaging with Negative Entropy Regularizer in Autonomous Driving

Kou, Wei-Bin	The University of Hong Kong
Zhu, Guangxu	Shenzhen Research Institute of Big Data
Cheng, Bingyang	The University of Hong Kong
Wang, Shuai	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Tang, Ming	Southern University of Science and Technology
Wu, Yik-Chung	The University of Hong Kong
Keywords: Multi-Robot Systems, Deep Learning for Visual Perception, AI-Enabled Robotics Abstract: Street Scene Semantic Understanding (denoted as S3U) is a crucial but complex task for autonomous driving (AD) vehicles. Their inference models typically face poor generalization due to domain-shift. Federated Learning (FL) has emerged as a promising paradigm for enhancing the generalization of AD models through privacy-preserving distributed learning. However, these FL AD models face significant temporal catastrophic forgetting when deployed in dynamically evolving environments, where continuous adaptation causes abrupt erosion of historical knowledge. This paper proposes Federated Exponential Moving Average (FedEMA), a novel framework that addresses this challenge through two integral innovations: (I) Server-side model's historical fitting capability preservation via fusing current FL round's aggregation model and a proposed previous FL round's exponential moving average (EMA) model; (II) Vehicle-side negative entropy regularization to prevent FL models' possible overfitting to EMA-introduced temporal patterns. Above two strategies empower FedEMA a dual-objective optimization that balances model generalization and adaptability. Extensive experiments both on Cityscapes dataset and Camvid dataset demonstrate FedEMA's superiority over existing approaches, showing 7.12% higher mean Intersection-over-Union (mIoU).

13:30-13:35, Paper WeBT18.3
Exploring Spontaneous Social Interaction Swarm Robotics Powered by Large Language Models

Jiang, Yitao	Dartmouth College
Zhao, Luyang	Dartmouth College
Quattrini Li, Alberto	Dartmouth College
Chen, Muhao	University of Kentucky
Balkcom, Devin	Dartmouth College
Keywords: Swarm Robotics, AI-Enabled Robotics, Multi-Robot Systems Abstract: Traditional swarm robots rely on specific communication and planning strategies to coordinate particular tasks. Human swarms exhibit distinctive characteristics due to their capacity for language-based communication and active reasoning. This paper presents an exploratory approach to robotic swarm intelligence that leverages Large Language Models (LLMs) to emulate human-like active problem-solving behaviors. We introduce a decentralized multi-robot system where each robot initially only has its local information and does not know the existence of the others. The robots utilize LLMs for reasoning and natural language for inter-robot communication, enabling them to discover peers, share information, and coordinate actions dynamically. In a series of experiments in zero-shot settings, we observed human-like social behaviors, including mutual discovery, identification, information exchange, collaboration, negotiation, and error correction. While the technical approach is straightforward, the main contribution lies in exploring the interactive societies that LLM-driven robots form -- a form of robot social dynamics (or robotic social behavior analysis), examining how human-like communication protocols and collaborative structures emerge among robots through language-based interaction. In this context, we use the term "robot anthropomorphism" to describe the interaction patterns that arise within robot collectives, inspired by but distinct from traditional human anthropology.

13:35-13:40, Paper WeBT18.4
Transformable Modular Robots: A CPG-Based Approach to Independent and Collective Locomotion

Ding, Jiayu	Syracuse University
Jakkula, Rohit Kumar	Syracuse University
Xiao, Ruixuan	Syracuse University
Gan, Zhenyu	Syracuse University
Keywords: Cellular and Modular Robots, Distributed Robot Systems, Multi-Robot Systems Abstract: Modular robotics offers a promising approach for developing versatile and adaptive robotic systems capable of autonomous reconfiguration. This paper presents a novel modular robotic system in which each module is equipped with independent actuation, battery power, and control, enabling both individual mobility and coordinated locomotion. The system employs a hierarchical Central Pattern Generator (CPG) framework, where a low-level CPG governs the motion of individual modules, while a high-level CPG facilitates inter-module synchronization, allowing for seamless transitions between independent and collective behaviors. To validate the proposed system, we conduct both simulations in MuJoCo and hardware experiments, evaluating the system’s locomotion capabilities under various configurations. We first assess the fundamental motion of a single module, followed by two-module cooperative and four-module locomotion. The results demonstrate the effectiveness of the CPG-based control framework in achieving robust, flexible, and scalable locomotion. The proposed modular architecture has potential applications in search-and-rescue operations, environmental monitoring, and autonomous exploration, where adaptability and reconfigurability are essential for mission success.

13:40-13:45, Paper WeBT18.5
MODUR: A Modular Dual-Reconfigurable Robot

Gu, Jie	Fudan University
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Tian, Chunxu	Fudan University
Xia, Zhihao	Fudan University
Xing, Yongheng	Fudan University
Zhang, Dan	The Hong Kong Polytechnic University
Keywords: Cellular and Modular Robots, Multi-Robot Systems, Parallel Robots Abstract: Modular Self-Reconfigurable Robot (MSRR) systems are a class of robots capable of forming higher-level robotic systems by altering the topological relationships between modules, offering enhanced adaptability and robustness in various environments. This paper presents a novel MSRR called MODUR, featuring dual-level reconfiguration capabilities designed to integrate reconfigurable mechanisms into MSRR. Specifically, MODUR can perform high-level selfreconfiguration among modules to create different configurations, while each module is also able to change its shape to execute basic motions. The design of MODUR primarily includes a compact connector and scissor linkage groups that provide actuation, forming a parallel mechanism capable of achieving both connector motion decoupling and adjacent position migration capabilities. Furthermore, the workspace, considering the interdependent connectors, is comprehensively analyzed, laying a theoretical foundation for the design of the module’s basic motion. Finally, the motion of MODUR is validated through a series of experiments.

13:45-13:50, Paper WeBT18.6
Topology-Driven Trajectory Optimization for Modelling Controllable Interactions within Multi-Vehicle Scenario

Ma, Changjia	Fudan University
Zhao, Yi	Fudan University
Gan, Zhongxue	Fudan University
Gao, Bingzhao	Tongji University
Ding, Wenchao	Fudan University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Multi-Robot Systems Abstract: Trajectory optimization in multi-vehicle scenarios faces challenges due to its non-linear, non-convex properties and sensitivity to initial values, making interactions between vehicles difficult to control. In this paper, inspired by topological planning, we propose a differentiable local homotopy invariant metric to model the interactions. By incorporating this topological metric as a constraint into multi-vehicle trajectory optimization, our framework is capable of generating multiple interactive trajectories from the same initial values, achieving controllable interactions as well as supporting user-designed interaction patterns. Extensive experiments demonstrate its superior optimality and efficiency over existing methods. We will release open-source code to advance relative research.

13:50-13:55, Paper WeBT18.7
Prescribed-Time Robust Synchronization of Networked Heterogeneous Euler-Lagrange Systems (I)

Zuo, Gewei	Huazhong University of Science and Technology
Xu, Yaohang	Huazhong University of Science and Technology
Li, Mengmou	Hiroshima University
Zhu, Lijun	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Multi-Robot Systems, Networked Robots, Distributed Robot Systems Abstract: In this paper, we propose a prescribed-time synchronization (PTS) algorithm for networked Euler-Lagrange systems subjected to external disturbances. Notably, the system matrix and the state of the leader agent are not accessible to all agents. The algorithm consists of distributed prescribed-time observers and local prescribed-time tracking controllers, dividing the PTS problem into prescribed-time convergence of distributed estimation errors and local tracking errors. Unlike most existing prescribed-time control methods, which achieve prescribed-time convergence by introducing specific time-varying gains and adjusting feedback values, we establish a class of mathcal K_T functions and incorporate them into comparison functions to represent time-varying gains. By analyzing the properties of class mathcal K_T and comparison functions, we ensure the prescribed-time convergence of distributed estimation errors and local tracking errors, as well as the uniform boundedness of internal signals in the closed-loop systems. External disturbances are handled and dominated by the time-varying gains that tend to infinity as time approaches the prescribed time, while the control signal is still guaranteed to be bounded. Finally, a numerical example and a practical experiment demonstrate the effectiveness and innovation of the algorithm.

13:55-14:00, Paper WeBT18.8
Control of Multiple Identical Mobile Microrobots for Collaborative Tasks Using External Distributed Magnetic Fields (I)

Cui, Guangming	Tsinghua University
Qu, Juntian	Tsinghua University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Distributed Robot Systems, Multi-Robot Systems Abstract: The collaboration of microrobot teams has attracted considerable attention, particularly in the field of micro/nano manipulation. Achieving independent control and motion planning ofmultiple magnetic microrobots for coordinated movements is one of the most important tasks that is still unsolved. In this paper, a 12 × 12 coil array system is developed to generate a series of localized magnetic fields that enable independent control of multiple identical magnetic microrobots, allowing teams of microrobots to collaborate in parallel for micromanipulation tasks. The structure of the microcoil is optimized on the basis of the finite element model to increase the strength and gradient of the magnetic field, which in turn enhances the driving performance of the system. With the combination of the Conflict Based Search (CBS) algorithm, collaborative planning is also achieved for multiple magnetic microrobots. The developed system is tested with extensive physical experiments, and the results impressively demonstrated the effectiveness of the devised system and the proposed methods.


WeBT19	210C
Biologically-Inspired Robots 2	Regular Session
Chair: Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Co-Chair: Ficuciello, Fanny	Università Di Napoli Federico II

13:20-13:25, Paper WeBT19.1
Design, Manufacturing, and Experiments of an Origami-Based Parallel-Legged Structure for Insect-Scale Robots

Zhu, Qunwei	National University of Defense Technology
Xia, Minghai	National University of Defense Technology
Jiang, Tao	National University of Defense Technology
Lu, Zhongyue	National University of Dfense Technology
Zhu, Yiming	The University of Manchester
Luo, Zirong	National University of Defense Technology
Keywords: Additive Manufacturing, Micro/Nano Robots, Biologically-Inspired Robots Abstract: Aiming to address the challenges associated with complex manufacturing processes and the difficulties in batch production of insect-scale robots. A mechatronic origami mechanism applied to an insect-scale parallel-legged structure is designed, manufactured, and tested. The origami mechanism is constructed using a multilayer composite laminate, which allows for the integrated fabrication of robotic hinges, linkages, and actuators. Utilizing the origami mechanism, it becomes feasible to fold and create the insect-scale parallel-legged structure. This enables the rapid assembly of various types of insect-scale robots, including monopods, bipedal robots, quadrupeds, and hexapods. We built the experimental prototype and test environments to validate the kinematic performance of the insect-scale parallel-legged structure. The monopod robot, weighing 200 mg and featuring the parallel leg, possesses the ability to rotate around the center of an adaptive rotating platform at a speed of 5 cm/s. The bipedal robot demonstrates the ability to navigate the rotating platform by performing alternating leg swings. The quadrupedal robot, designed with four parallel-legged structures, exhibited a movement speed of 1.9 cm/s when actuated at a frequency of 20 Hz. In contrast, the hexapod robot achieved a superior speed of 3.25 cm/s under the same actuation frequency of 20 Hz. The origami mechanism and the insect-scale parallel-legged structure provide a new method for the design and fabrication of insect-scale robots.

13:25-13:30, Paper WeBT19.2
Adaptive Morphing and Environmental-Phase-Transition Enables Effective Locomotion Inside Granular Media

Wang, Yiliang	School of Mechanical Engineering , Tiangong University
Xiao, Xuan	Tiangong University
He, Shuqian	Beijing University of Chemical Technology
Yanxiang, Han	Tiangong University
Kang, Shuai	Beijing University of Chemical Technology
Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Tokuda, Isao T.	Ritsumeikan University
Li, Longchuan	Beijing University of Chemical Technology
Keywords: Biologically-Inspired Robots, Biomimetics, Field Robots Abstract: This study introduces a novel burrowing robot that achieves effective locomotion inside granular media through the synergistic integration of high-frequency vibration-induced environmental phase transition (EPT) and adaptive morphing. The robotic system employs three key innovations: 1) an asymmetric arm trajectory mechanism generating directional propulsion, 2) a vibration-mediated granular fluidization system reducing environmental resistance, and 3) passively adaptive claws demonstrating phase-dependent configuration changes. Experimental results demonstrate that the synchronization of morphologically adaptive claws and high-frequency vibration significantly improves locomotion performance. Additionally, numerical simulations based on Adams-EDEM coupling provide deeper insights into the interaction mechanisms between the robot and granular media. This work advances fundamental understanding of terradynamic locomotion by demonstrating environmental modification as a viable strategy for resistance reduction, while providing a bio-inspired framework for developing versatile robotic systems capable of navigating complex particulate environments.

13:30-13:35, Paper WeBT19.3
RoboNotonecta: A Backswimmer-Inspired Swimming Miniature Robot with Efficient Low-Power Propulsion and Agile Aquatic Maneuverability

Wu, Chaofeng	Shanghai Jiao Tong University
Zhao, Jiaxin	Xiamen University of Technology
Zhang, Yichen	Shanghai Jiao Tong University
Guo, Qingcheng	Shanghai Jiaotong University
Cui, Feng	Shanghai Jiao Tong University
Wu, Xiaosheng	Shanghai Jiao Tong University
Liu, Wu	Shanghai Jiao Tong University
Keywords: Biologically-Inspired Robots, Biomimetics, Micro/Nano Robots Abstract: In this letter, we present the design, manufacturing, and performance test of a centimeter-scale two-legged swimming robot, RoboNotonecta, inspired by the efficient and agile locomotion of an aquatic beetle backswimmer (Notonectid). The robot utilizes a crank-slider and slider-rocker paddling mechanism for propulsion and steering. Its swimming legs are made from Smart Composite Microstructures (SCM) and feature an asymmetric anti-bending stiffness design. This design allows the legs to bend during the recovery stroke and expand during the power stroke, enabling efficient thrust generation. RoboNotonecta has a total mass of 10.3 g and a body length (BL) of 6.4 cm. It can swim at speeds up to 16.2 cm/s (2.5 BL/s), with a minimum turning radius of 4.7 cm (0.73 BL). Equipped with an onboard battery, it can move freely on the water surface while avoiding obstacles. These capabilities demonstrate its potential for environmental monitoring and exploration applications.

13:35-13:40, Paper WeBT19.4
A Mole-Inspired Incisor-Burrowing Robotic Platform for Planetary Exploration

Xu, Ran	Beihang University
Liu, Jiabin	Guangdong University of Technology
Liang, ZhaoFeng	GuangDong University of Technology
Zheng, Hongmin	GuangDong University of Technology
Zheng, Kunquan	Guangdong University of Technology
Chen, Zibiao	Guangdong University of Technology
Chen, Jiawei	Beihang University
Zhang, Tao	Beihang University
Xu, Kun	Beihang University
Ding, Xilun	Beijing University of Aeronautics & Astronautics(BUAA)
Keywords: Biologically-Inspired Robots, Space Robotics and Automation, Legged Robots Abstract: Planetary exploration requires efficient methods for subsurface sampling, especially in extreme energy limitations. Traditional drilling methods are often energy intensive and require large platforms, limiting their applicability. Bio-inspired burrowing techniques, inspired by animals like moles, offer lightweight, low-power alternatives suitable for small robotic platforms. This paper presents a novel bio-inspired robotic platform, the Mole-like Incisor-Burrowing Robotic Platform (MIRP), designed to mimic the incisor-burrowing behavior of naked mole rats. The MIRP features an 11 DOFs mechanism with a compact design (220 mm × 140 mm × 80 mm) and uses servomotors to achieve low energy consumption. The robot combines a qu0adrupedal locomotion mechanism with an incisor-burrowing mechanism, allowing it to navigate granular terrains and perform excavation tasks. Kinematic analysis, including inverse kinematics and close-chain analysis, was conducted to optimize the robot's motion strategy. A prototype was developed and tested in a simulated lunar regolith environment to test its maneuverability and burrowing performance. The power consumption of the prototype is below 10 W. This work validates the feasibility of bio-inspired incisor-burrowing for planetary exploration, offering a cost-effective and efficient solution for future extraterrestrial missions.

13:40-13:45, Paper WeBT19.5
Development of a Hard Matter Crushing Peristaltic Bioreactor Inspired by an Avian Gizzard Structure for Fermentation Acceleration

Kikyodani, Kentaro	Chuo University
Enomoto, Yuki	Chuo University
Uchino, Masataka	Tokyo University of Agriculture
Nomura, Kaho	Tokyo University of Agriculture
Nishihama, Rie	Chuo University
Nakamura, Taro	Chuo University
Keywords: Biologically-Inspired Robots, Biomimetics, Human-Centered Automation Abstract: In this work, we developed a peristaltic bioreactor with an enhanced crushing capability, inspired by the structure of the avian gizzard. Existing peristaltic bioreactors have limited ability to crush boluses, which makes the fermentation of substances such as agar gel time-consuming. To improve crushing capacity, we focused on bird gizzard. Birds utilize pebbles in their gizzard to aid in food crushing. Our approach replicates this mechanism by incorporating both fixed and freely movable spherical solids, which are compressed during operation, inside the bioreactor. An agar gel crushing experiment demonstrated improved crushing efficiency. Furthermore, in a mixed fermentation experiment using milk agar gel and yogurt, the pH value improved by 51.4% compared with that observed using a conventional device, indicating an increase in lactic acid bacteria. These results confirm that the proposed method effectively enhances fermentation.

13:45-13:50, Paper WeBT19.6
Micro-UAV with Ant-Inspired Bistable Gripper for Adaptive Perching and Wildlife Detection

Liu, Yuan	Beijing University of Posts and Telecommunications
Mo, Yadong	Beijing University of Posts and Telecommunieations
Liang, Xuexiu	China Software Testing Center (MIIT Software and Integrated Circ
Jiang, Yongkang	Tongji University
Li, Jian	Beihang University & National Research Center for Rehabilitation
Wei, Shimin	Beijing University of Posts and Telecommunications
Keywords: Biologically-Inspired Robots, Aerial Systems: Applications, Grippers and Other End-Effectors Abstract: With the global ecological environment facing continuous deterioration, effective monitoring of arboreal birds in complex canopy environments remains challenging due to limitations of conventional drones in endurance, size, and habitat disturbance. To address these challenges, this paper presents an ant-inspired micro quadrotor UAV equipped with a lightweight bistable gripper system mimicking the mandibular morphology of leafcutter ants. The design integrates shape memory alloy (SMA)-driven actuation and thermoplastic polyurethane (TPU)-based adaptive grippers, enabling rapid deformation (71 ms switching time) and energy-efficient operation (zero power consumption during perching). Experimental results demonstrate exceptional adaptability in grasping irregular objects (e.g., branches, pen caps) with an 8:1 payload-to-weight ratio. Field tests confirm stable navigation through dense foliage and reliable perching at heights exceeding 5 meters. The system’s compact dimensions (7 cm diameter, 70.5 g weight) and biomimetic approach offer a non-invasive solution for prolonged wildlife observation. This work advances bistable actuator design by combining bio-inspired structural optimization with rapid energy transition principles, showing potential in agile robotics and environmental sensing.

13:50-13:55, Paper WeBT19.7
Performance Consequences of Information-Based Centralization Arising from Neural and Mechanical Coupling in a Walking Robot

Liu, Ellen	Georgia Institute of Technology
Asawalertsak, Naris	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Sponberg, Simon N.	Georgia Institute of Technology
Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Keywords: Biologically-Inspired Robots, Legged Robots, Performance Evaluation and Benchmarking Abstract: Legged animals still outperform many terrestrial robots due to the complex interplay of various component subsystems. Centralization is a potential integrated design axis to help improve the performance of legged robots in variable terrain environments. Centralization arises from the coupling of multiple limbs and joints through mechanics or feedback control. Strong couplings contribute to a whole-body coordinated response (centralized) and weak couplings result in localized responses (decentralized). Rarely are both mechanical and neural couplings considered together in designing centralization. In this study, we use an empirical information theory-based approach to evaluate the emergent centralization of a hexapod robot. We independently vary the mechanical and neural coupling through adjustable joint stiffness and variable coupling of leg controllers, respectively. We found an increase in centralization as neural coupling increased. Changes in mechanical coupling did not significantly affect centralization during walking, but did change the total information processing of the neuromechanical control architecture. Information-based centralization increased with robotic performance in terms of cost of transport and speed, implying that this may be a useful metric in robotic design.


WeBT20	210D
Grasping & Manipulation 2	Regular Session
Chair: Xu, Sheng	Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Co-Chair: Kheddar, Abderrahmane	CNRS-AIST

13:20-13:25, Paper WeBT20.1
Fusion-Perception-To-Action Transformer: Enhancing Robotic Manipulation with 3D Visual Fusion Attention and Proprioception

Liu, Yangjun	University of Macau
Liu, Sheng	Southern University of Science and Technology
Chen, Binghan	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Yang, Zhi-Xin	University of Macau
Xu, Sheng	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Keywords: Manipulation Planning, Learning from Demonstration, Perception for Grasping and Manipulation, Bio-inspired Robot Learning Abstract: Most prior robot learning methods focus on image-based observations, limiting their capability in 3D robotic manipulation. Voxel representation naturally delivers rich spatial features but remains underutilized. This paper proposed a novel Fusion-Perception-to-Action Transformer (FP2AT) with cross-layer feature aggregation to handle fine-grained manipulation in 3D space. In particular, a multi-scale 3D visual fusion attention mechanism is devised to draw attention to local regions of interest and maintain awareness of global scenes, thereby boosting the capabilities of visual perception and action planning. Meanwhile, a 3D visual mutual attention mechanism is designed and it can also enhance spatial perception. Besides, we further explore the potential of FP2AT by developing its coarse-to-fine version, which progressively refines the action space for more precise predictions. Furthermore, a new metric, the average number of key actions (ANKA), is introduced to evaluate efficiency and planning capability. In various simulated and real-robot examples, our methods significantly outperform state-of-the-art 3D-vison-based methods in success rate and ANKA metrics.

13:25-13:30, Paper WeBT20.2
Multi-Critic Reinforcement Learning for Garment Handling: Addressing Unpredictability in Temporal-Phase Continuous Contact Tasks (I)

Zhang, Yukuan	Tohoku University
Chen, Dayuan	Tohoku University
He, Weizan	Tohoku University
Petrilli Barceló, Alberto Elías	Tohoku University
Salazar Luces, Jose Victorio	Tohoku University
Hirata, Yasuhisa	Tohoku University
Keywords: Manipulation Planning, Reinforcement Learning, Simulation and Animation Abstract: This research unveils a novel Multi-Critic Reinforcement Learning framework designed to navigate the multifaceted challenges associated with multi-phased garment handling tasks, notably marked by persistent contact and erratic deformations between textiles and solid bodies. These tasks, ubiquitous in domestic and industrial environments, encompass activities such as dressing, fabric printing, and pressing, and are complicated by the unpredictability of textile states and the intricacy of devising control strategies. Our reinforcement learning model combines multiple time-sequenced Critic networks with traditional Deep Deterministic Policy Gradient (DDPG) techniques, thereby equipping the system to adapt to the diverse effects of fabric distortions throughout various stages. The effectiveness of this approach is demonstrated through a multi-phase pre-printing operation and further validated by real-world implementations, showing significant improvements in coverage and a substantial reduction in wrinkle formation, with its versatility further confirmed by a complex vertical dressing task. We anticipate future applications of this framework in a range of complex problems, not just garment handling. The model used in this paper can be found at https://github.com/jkk5454/multiddpg.git.

13:30-13:35, Paper WeBT20.3
Minimal Impact Pokes to Place Objects on Planar Surfaces

Zermane, Ahmed	CNRS-LIRMM
Moussafir, Leo	CNRS-UM, LIRMM
Yan, Youcan	The French National Center for Scientific Research (CNRS)
Kheddar, Abderrahmane	CNRS-AIST
Keywords: Manipulation Planning, Constrained Motion Planning, Contact Modeling Abstract: We present a planning and control method that computes a minimal sequence of pokes to slide a given object from an initial pose to a desired final one (or as close to it as possible) on a planar surface. Both planning and control are based on impact models to generate pokes. Our framework takes into account the object's dynamics with a rich contact model and parameters to plan the poking sequence. The planning is conducted in the joint-space and generates trajectories tracked using an impact-aware QP control, which corrects for post-pokes errors using discrete visual feedback. We implemented our method on a Panda robot arm and assessed its versatility and robustness. The experimental results show that the proposed poking approach can bring the object to the desired position and orientation with minimal errors (0.05~m for translation and 0.2~rad for rotation), highlighting its potential application in diverse industrial scenarios such as logistics.

13:35-13:40, Paper WeBT20.4
Planning for Quasi-Static Manipulation Tasks Via an Intrinsic Haptic Metric: A Book Insertion Case Study

Yang, Lin	Nanyang Technological University
Turlapati, Sri Harsha	Nanyang Technological University
Lv, Chen	Nanyang Technological University
Campolo, Domenico	Nanyang Technological University
Keywords: Manipulation Planning, Contact Modeling Abstract: Contact-rich manipulation often requires strategic interactions with objects, such as pushing to accomplish specific tasks. We propose a novel scenario where a robot inserts a book into a crowded shelf by pushing aside neighboring books to create space before slotting the new book into place. Classical planning algorithms fail in this context due to limited space and their tendency to avoid contact. Additionally, they do not handle indirectly manipulable objects or consider force interactions. Our key contributions are: i) re-framing quasi-static manipulation as a planning problem on an implicit manifold derived from equilibrium conditions; ii) utilizing an intrinsic haptic metric instead of ad-hoc cost functions; and iii) proposing an adaptive algorithm that simultaneously updates robot states, object positions, contact points, and haptic distances. We evaluate our method on such crowded bookshelf insertion task but it is a general formulation to rigid bodies manipulation tasks. We propose proxies to capture contact point and force, with superellipse to represent objects. This simplified model guarantee the differentiablity. Our framework autonomously discovers strategic wedging-in policies while our simplified contact model achieves behavior similar to real world scenarios. We also vary the stiffness and initial positions to analysis our framework comprehensively. The video can be found at https://youtu.be/eab8umZ3AQ0.

13:40-13:45, Paper WeBT20.5
Beyond Anthropomorphism: Enhancing Grasping and Eliminating a Degree of Freedom by Fusing the Abduction of Digits Four and Five

Fritsch, Simon	ETH Zurich
Achenbach, Liam	ETH Zurich
Bianco, Riccardo	ETH Zurich
Irmiger, Nicola	ETH Zurich
Marti, Gawain	ETH Zurich
Visca, Samuel Maximilian	ETH Zürich
Yang, Chenyu	ETH Zurich
Liconti, Davide	ETH Zurich
Cangan, Barnabas Gavin	ETH Zurich
Malate, Robert Jomar	ETH Zurich
Hinchet, Ronan	ETH Zurich
Katzschmann, Robert Kevin	ETH Zurich
Keywords: Multifingered Hands, Mechanism Design, Product Design, Development and Prototyping Abstract: This paper presents the SABD hand, a 16-degree-of-freedom (DoF) robotic hand that departs from purely anthropomorphic designs to achieve an expanded grasp envelope, enable manipulation poses beyond human capability, and reduce the required number of actuators. This is achieved by combining the adduction/abduction (Add/Abd) joint of digits four and five into a single joint with a large range of motion. The combined joint increases the workspace of the digits by 400% and reduces the required DoFs while retaining dexterity. Experimental results demonstrate that the combined Add/Abd joint enables the hand to grasp objects with a side distance of up to 200 mm. Reinforcement learning-based investigations show that the design enables grasping policies that are effective not only for handling larger objects but also for achieving enhanced grasp stability. In teleoperated trials, the hand successfully performed 86% of attempted grasps on suitable YCB objects, including challenging non-anthropomorphic configurations. These findings validate the design’s ability to enhance grasp stability, flexibility, and dexterous manipulation without added complexity, making it well-suited for a wide range of applications.

13:45-13:50, Paper WeBT20.6
Functional Eigen-Grasping Using Approach Heatmaps

Aburub, Malek	Osaka University
Higashi, Kazuki	Osaka University
Wan, Weiwei	Osaka University
Harada, Kensuke	Osaka University
Keywords: Multifingered Hands, Grasping Abstract: This work presents a framework for a robot with a multi-fingered hand to freely utilize daily tools, including functional parts like buttons and triggers. An approach heatmap is generated by selecting a functional finger, indicating optimal palm positions on the object's surface that enable the functional finger to contact the tool's functional part. Once the palm position is identified through the heatmap, achieving the functional grasp becomes a straightforward process where the fingers stably grasp the object with low-dimensional inputs using the eigengrasp. As our approach does not need human demonstrations, it can easily adapt to various sizes and designs, extending its applicability to different objects. In our approach, we use directional manipulability to obtain the approach heatmap. In addition, we add two kinds of energy functions, i.e., palm energy and functional energy functions, to realize the eigengrasp. Using this method, each robotic gripper can autonomously identify its optimal workspace for functional grasping, extending its applicability to non-anthropomorphic robotic hands. We show that several daily tools like spray, drill, and remotes can be efficiently used by not only an anthropomorphic Shadow hand but also a non-anthropomorphic Barrett hand.

13:50-13:55, Paper WeBT20.7
Robotic Haptic Exploration of Object Shape with Autonomous Symmetry Detection

Bonzini, Aramis	Queen Mary University of London
Seminara, Lucia	University of Genova-DITEN
Macciò, Simone	University of Genoa
Carfì, Alessandro	University of Genoa
Jamone, Lorenzo	University College London
Keywords: Shape Exploration, Perception for Grasping and Manipulation, Force and Tactile Sensing, Probability and Statistical Methods Abstract: Haptic robotic exploration aims to control the movements of a robot with the objective of touching an object and retrieving physical information about it. In this work, we present an innovative exploration strategy to simultaneously detect symmetries in a 3D object and use this information to enhance shape estimation. This is achieved by leveraging a novel formulation of Gaussian Process models that allows the modeling of symmetric surfaces. Our procedure does not assume any prior knowledge about the object, neither about its shape nor about the presence and type of symmetry, necessitating only an approximate estimate of the size and boundaries (bounding box). We report experimental results both in simulation and in the real world, showing that using symmetric models leads to a reduction in shape estimation error, exploration time, and in the number of physical contacts performed by a robot when exploring objects that have symmetries.


WeBT21	101
Force and Tactile Sensing 2	Regular Session
Chair: Jiang, Xin	Harbin Institute of Technology, Shenzhen

13:20-13:25, Paper WeBT21.1
Vision-Based Tactile Sensor Using Light-Conductive Plate for Enhanced Force Sensing Capability

Liu, Zhitong	Harbin Institute of Technology, Shenzhen
Liao, Wenxi	Harbin Institute of Technology
Jiang, Xin	Harbin Institute of Technology, Shenzhen
Keywords: Force and Tactile Sensing Abstract: In recent years, tactile sensors have become essential for robotic systems, particularly in tasks requiring high-precision interaction and manipulation. The Vision-Based Tactile Sensor (VBTS) represents a significant advancement in tactile sensing, utilizing cameras to monitor the deformation of soft materials at the sensor tip. Pressure applied to the sensor alters the light propagation path, thereby changing the image captured by the camera. By combining image processing and deep learning, VBTS provides highly accurate estimates of contact position and force, achieving micrometer-level resolution. This paper presents a novel VBTS design that leverages a light-conductive plate and a silicone membrane to enhance the sensor's sensitivity to force perception. The soft, thin nature of the silicone membrane allows for precise detection of minimal forces, making it suitable for tasks involving highly deformable objects. Experimental results demonstrate the sensor's capability in detecting contact areas and force distributions, which can be applied in diverse domains such as soft object assembly, medical assistance, and food processing. Moreover, the proposed VBTS outperforms traditional sensors by utilizing computationally efficient algorithms that maintain real-time performance without compromising resolution.

13:25-13:30, Paper WeBT21.2
Robotic Hand Tool Use with Contact-Based Demonstration: The Case of Cucumber Peeling

Xie, Lingzi	South China University of Technology
Wang, Shuai	Tencent
Chen, Jingxiang	South China University of Technology
Huang, Bidan	Tencent
Zhang, Yi	South China University of Technology
Yang, Sicheng	Tencent
Chen, Yuyuan	South China University of Technology
Lee, Wang Wei	Tencent
Yang, Jialong	South China University of Technology; Peng Cheng Laboratory
Liu, Tianliang	Harbin Institute of Technology
Zheng, Yu	Tencent
Yang, Chenguang	University of Liverpool
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Grippers and Other End-Effectors Abstract: Robotic hand tool use has garnered significant attention from robotics researchers, because it enhances dexterity beyond the limitations imposed by manipulators with fixed tool configurations and human-involved manual tool changes. Despite extensive research, current methodologies predominantly focus on imitating human hand trajectories, often neglecting the pivotal role of tool-environment interaction. This study addresses this gap by exploring the task of cucumber peeling as a case study to implement contact-based demonstration strategies in robotic tool use. Our approach concentrates on the subtle tool contact behaviors that manifest through contact dynamics. Specifically, we select appropriate tool stiffness for the peeling tasks, which is captured via a handheld teaching device equipped with optical tactile sensors. Subsequently, object-level stiffness control strategies are employed to emulate these behaviors using a three-fingered robotic hand. Experimental results from real-world cucumber peeling trials substantiate our methodology, illustrating that the robotic hand can adjust contact through finger movements, thereby achieving human-like peeling efficiency without necessitating alterations to the tool structure. This study not only demonstrates the feasibility of sophisticated tool use by robotic hands, but also highlights the critical importance of integrating tactile feedback to refine interaction with the environment.

13:30-13:35, Paper WeBT21.3
Bio-Skin: A Cost-Effective Thermostatic Tactile Sensor with Multi-Modal Force and Temperature Detection

Guo, Haoran	Oklahoma State University
Wang, Haoyang	Oklahoma State University
Li, Zhengxiong	University of Colorado Denver
Tao, Lingfeng	Kennesaw State University
Keywords: Force and Tactile Sensing, Multi-Modal Perception for HRI, Soft Sensors and Actuators Abstract: Tactile sensors can significantly enhance the perception of humanoid robotics systems by providing contact information that facilitates human-like interactions. However, existing commercial tactile sensors focus on improving the resolution and sensitivity of single-modal detection with high-cost components and densely integrated design, incurring complex manufacturing processes and unaffordable prices. In this work, we present Bio-Skin, a cost-effective multi-modal tactile sensor that utilizes single-axis Hall-effect sensors for planar normal force measurement and bar-shape piezo resistors for 2D shear force measurement. A thermistor coupling with a heating wire is integrated into a silicone body to achieve temperature sensation and thermostatic function analogous to human skin. We also present a cross-reference framework to validate the two modalities of the force sensing signal, improving the sensing fidelity in a complex electromagnetic environment. Bio-Skin has a multi-layer design, and each layer is manufactured sequentially and subsequently integrated, thereby offering a fast production pathway. After calibration, Bio-Skin demonstrates performance metrics—including signal-to-range ratio, sampling rate, and measurement range—comparable to current commercial products, with one-tenth of the cost. The sensor’s real-world performance is evaluated using an Allegro hand in object grasping tasks, while its temperature regulation functionality was assessed in a material detection task.

13:35-13:40, Paper WeBT21.4
R-Tac0: A Rounded High-Frequency Transferable Monochrome Vision-Based Tactile Sensor for Shape Reconstruction

Li, Wanlin	Beijing Institute for General Artificial Intelligence (BIGAI)
Lin, Pei	ShanghaiTech University
Wang, Meng	Beijing Institute for General Artificial Intelligence
Xiao, Chenxi	ShanghaiTech University
Althoefer, Kaspar	Queen Mary University of London
Su, Yao	Beijing Institute for General Artificial Intelligence
Jiao, Ziyuan	Beijing Institute for General Artificial Intelligence
Liu, Hangxin	Beijing Institute for General Artificial Intelligence (BIGAI)
Keywords: Force and Tactile Sensing Abstract: Abstract— Endowing the curved surfaces of rounded vision-based tactile fingers is essential for dexterous robotic manipulation, as they offer more sufficient contact with the environment. However, current rounded designs are constrained by a low sensing frequency (30–60 Hz) and the need for recalibration when adapting to new sensors due to the reliance on multi-channel captures, which hinders their performance in dynamic robotic tasks and large-scale deployment. In this work, we introduce R-FTact, a low-cost rounded VBTS engineered for high-resolution and high-speed perception. The key innovation is a monochrome vision-based sensing principle: utilizing a black-and-white camera to capture the reflection properties of the compound rounded elastomer under monochromatic illumination. This single-channel imaging significantly reduces data volume and simplifies computational complexity, enabling 120 Hz tactile perception. A lightweight neural network can calibrate the sensor to achieve a depth reconstruction accuracy of 0.169 mm per pixel, while exhibiting surprisingly good transferability to new sensors. In experiments, we demonstrate the advantages of R-FTact’s rounded design by evaluating its performance under different contact angles, its high-frequency perception in slip detection, and its effectiveness in robotic dynamic pose estimation.

13:40-13:45, Paper WeBT21.5
Vibrotactile Sensing for Detecting Misalignments in Precision Manufacturing

Zhang, Kevin	Carnegie Mellon University
Chang, Chris	Carnegie Mellon University
Aggarwal, Shobhit	Carnegie Mellon University
Veloso, Manuela	Carnegie Mellon University
Temel, Zeynep	Carnegie Mellon University
Kroemer, Oliver	Carnegie Mellon University
Keywords: Force and Tactile Sensing, Assembly, Failure Detection and Recovery Abstract: Small and medium-sized enterprises (SMEs) often struggle with automating high-mix, low-volume (HMLV) manufacturing due to the inflexibility and high cost of traditional automation solutions. This paper presents a novel approach to robotic manipulation for HMLV environments that leverages vibrotactile sensing. We propose integrating vibrotactile sensors, which capture subtle vibrations and acoustic signals, to provide real-time feedback during manipulation tasks. This approach enables the robot to detect subtle misalignments, which can assist in refining vision-based policies and improving the robot's overall manipulation skills. We demonstrate the effectiveness of this method in several representative insertion tasks, showing how vibrotactile feedback can be used to predict success or failure of an insertion task as well as predict initial contact between an object grasped in-hand and the placement location. Our results suggest that vibrotactile sensing offers a promising pathway towards more robust and adaptable robotic systems that can better empower SMEs to embrace automation.

13:45-13:50, Paper WeBT21.6
Coil-Tac: Coiled Capacitor Mechanism with Liquid Metal for Tactile Sensing

Jenkinson, George	University of Bristol
Conn, Andrew	University of Bristol
Tzemanaki, Antonia	University of Bristol
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Mechanism Design Abstract: Exploiting the high conductivity and fluidity of liquid metal, Coil-Tac is a soft transduction mechanism based on measuring change in capacitance as a flowable liquid metal core moves within a conductive helical coil. Using finite difference methods, a model is derived that estimates the response of Coil-Tac with various coil pitches to within 0.0502 pF of our experimental results in the range of 0-2.5 pF, corresponding to a lateral liquid metal movement of up to 35 mm. The Coil-Tac mechanism is demonstrated to be capable of oscillatory tactile sensing at 5 Hz and touch location estimation when coupled to a soft interface. The mechanism coupled with the interface is sensitive enough to locate the centre of the contact to within 0.23 mm, and estimate the incident angle between the axis of the dome and a flat surface to within 0.53◦.

13:50-13:55, Paper WeBT21.7
Soft Tactile Sensors for Robot Grippers Using Acoustic Sensing

Xu, Kevin	Carnegie Mellon University
Chan, Justin	Carnegie Mellon University
Keywords: Force and Tactile Sensing Abstract: We present a low-cost, soft tactile sensor using common, easily sourced materials that can be integrated with existing robotic gripper systems without requiring complex fabrication techniques or expensive components. Our approach includes two designs: a flexible linear sensor constructed from a rubber tube and a planar sensor made with a rubber membrane stretched over an enclosure. Both sensors contain an embedded speaker and microphones that leverage active acoustic sensing to map the unique acoustic resonant response of the cavity's structure to deformations that occur when the robotic gripper is grasping an object. Experimental results demonstrate that, using a support vector machine, the linear sensor achieves contact point estimation with an RMSE of 6mm, while the planar sensor achieves an RMSE of 0.57-0.62mm. Additionally, the planar sensor classifies six objects with an accuracy of 97.7%. These results demonstrate the potential for active acoustics to be an accessible method for enabling tactile sensing capabilities for robotic systems.


WeBT22	102A
Mechanism Design 1	Regular Session
Chair: Kamezaki, Mitsuhiro	The University of Tokyo
Co-Chair: Seo, TaeWon	Hanyang University

13:20-13:25, Paper WeBT22.1
Differential-Driven Wheeled Mobile Robot Mechanism with High Step-Climbing Ability

Lee, Woojae	HD Hyundai Robotics
Kim, Taehyun	HD Hyundai Robotics
Kim, Jeongeun	HD Hyundai Robotics
Seo, TaeWon	Hanyang University
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Compliant Joints and Mechanisms Abstract: Differential-driven wheeled mobile robots, such as logistics robots and mobile manipulators, are used for various tasks on flat ground. These differential-driven wheeled mobile robots are highly environmentally constrained when driven on flat ground. This study proposes a differential-driven wheeled mobile mechanism of the robot with high step-climbing ability. This work shows that the novel wheel and differential-driven module improve the ability to climb stairs through the transfer of center of mass(C.O.M). A sub-wheel connected to the passive joint of the wheel is used to convert the drive into a vertical force to improve the mobile robot’s ability to climb high stairs; therefore, by pitching the body of the wheel-powered robot through the advanced reaction force, the ability to climb stairs in reverse using the reaction force of the wall is improved. The prototype robot was tested for climbing stairs and high steps, as well as climbing a deformable slope while climbing obstacles. Even if the center of mass is located in the driving direction, this result uses a novel wheel to overcome the step in the front wheel and improve the overcoming performance of the rear wheel owing to the center-of-mass position in the driving direction. We expect that this method will be applicable to various differential-driven wheeled mobile robot mechanisms, especially for environments with confined spaces.

13:25-13:30, Paper WeBT22.2
A Redundant Parallel Continuum Manipulator with Stiffness-Varying and Force-Sensing Capability (I)

Tang, Shujie	Shanghai Jiao Tong University
Liang, Zhenkun	Shanghai Jiao Tong University
Zhang, Zhuang	Fudan University
Duan, Xuyang	Shanghai Jiao Tong University
Wang, Hao	Shanghai Jiao Tong University
Chen, Genliang	Shanghai Jiao Tong University
Keywords: Mechanism Design, Compliant Joints and Mechanisms, Redundant Robots Abstract: This paper presents the design, analysis, and validation of a novel redundant planar parallel continuum manipulator (PCM) consisting of four flexible links coupled at the rigid mid- and end-platform. To address complex geometry/static hybrid constraints at the rigid platforms, a general framework is developed for the kinetostatics modeling and analysis. Benefiting from the redundant actuation design, the Cartesian stiffness of the studied PCM can be further adjusted as a secondary task to positioning. Utilizing the proposed deflection-based force sensing method, the external load exerted on the end-effector can also be identified by measuring the pose of rigid platforms. To validate the proposed design, a prototype is built, on which a series of experiments have been conducted. The results show that, with the proposed kinetostatics models, the prototype exhibits a mean position and orientation error of 0.52 mm and 0.41°. Finally, the capability of stiffness-varying and force-sensing is demonstrated to verify the feasibility and potential applications of the designed redundant PCM.

13:30-13:35, Paper WeBT22.3
Joint-Repositionable Inner-Wireless Planar Snake Robot

Kanada, Ayato	Kyushu University
Takahashi, Ryo	The University of Tokyo
Hayashi, Keito	Kyushu University
Hosaka, Ryusuke	Department of Mechanical Engineering, Graduate School of Enginee
Yukita, Wakako	The University of Tokyo
Nakashima, Yasutaka	Kyushu University
Yokota, Tomoyuki	The University of TOkyo
Someya, Takao	University of Tokyo
Kamezaki, Mitsuhiro	The University of Tokyo
Kawahara, Yoshihiro	The University of Tokyo
Yamamoto, Motoji	Kyushu University
Keywords: Mechanism Design, Compliant Joints and Mechanisms, Soft Robot Applications Abstract: Bio-inspired multi-joint snake robots offer the advantages of terrain adaptability due to their limbless structure and high flexibility. However, a series of dozens of motor units in typical multiple-joint snake robots results in a heavy body structure and hundreds of watts of high power consumption. This paper presents a joint-repositionable, inner-wireless snake robot that enables multi-joint-like locomotion using a low-powered underactuated mechanism. The snake robot, consisting of a series of flexible passive links, can dynamically change its joint coupling configuration by repositioning motor-driven joint units along rack gears inside the robot. Additionally, a soft robot skin wirelessly powers the internal joint units, avoiding the risk of wire tangling and disconnection caused by the movable joint units. The combination of the joint-repositionable mechanism and the wireless-charging-enabled soft skin achieves a high degree of bending, along with a lightweight structure of 1.3 kg and energy-efficient wireless power transmission of 7.6 watts.

13:35-13:40, Paper WeBT22.4
A Small Water Surface Jumping Robot Utilizing Efficient Hydrodynamic Resistance Based on Rapid Recoil Design

Xia, Yingjun	Harbin Institute of Technology
Zhang, Xin	Harbin Institute of Technology
Li, Hangfei	Dalian Shipbuilding Industry Co., Ltd
Yan, Jihong	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Keywords: Mechanism Design, Dynamics, Kinematics Abstract: The energy of water surface jumping is easily lost due to the fluidity and splashing of water. As the robot's size and mass increase, its jumping performance is more significantly affected. The key to increasing the robot’s takeoff velocity is enhancing the impulse generated by efficient hydrodynamic resistance, which is primarily related to the stored energy and the leg paddling velocity. By designing and optimizing an energy storage mechanism based on the coupling of latex and carbon fiber, and a torque-reversing driving mechanism based on a double-four-bar linkage, the movement speed of the driving plate was increased, resulting in greater hydrodynamic resistance and thus enhancing the robot's take-off impulse. The robot has dimensions of 270 mm × 195 mm × 135 mm and weights 185 g. Its jumping performance is analyzed through jumping dynamic simulations on water surface and verified by experiments. And the load capacity of the robot was also tested. The robot's takeoff velocity is 4.11 m/s, with a maximum jumping height of 748 mm and jumping distance of 1176 mm.

13:40-13:45, Paper WeBT22.5
Design and Preliminary Evaluation of a Handheld Compliant Robot for Membrane Peeling Using Negative Pressure Adsorption (I)

Zheng, Yu	Nanjing University of Posts and Telecommunication
Liu, Jianjun	School of Mechanical Engineering and Automation, Beihang Univers
Yang, Yang	Beihang University
Yu, Jingjun	Beihang University
Guang, Chenhan	North China University of Technology
Wang, Zhaodong	Beihang University, Beijing, 100191, China
Keywords: Mechanism Design, Force Control Abstract: In retinal surgery, membrane peeling is a challenging procedure that requires surgeons to limit peeling force and manage disturbance force at the millinewton level. This article presents a novel handheld compliant robot designed to assist in membrane peeling, featuring a five-chain parallel compliant module and a force-sensing tube. The compliant module is used to adjust the tip’s transverse position actively, while the tube is designed to adsorb the membrane and sense the peeling force. The Jacobian matrix of the proposed robot is also derived. Then, a data process method based on wavelet decomposition and Kalman filtering is introduced to identify the disturbance force. A control method based on the Jacobian matrix is also proposed to compensate the disturbance force. Finally, the proposed robot is evaluated through bench test and handheld test, in which chicken chorioallantoic membrane is utilized as the ex-vitro model. The handheld test results show that the root mean square and maximum of disturbance force are reduced by 44.1% and 52.7%, respectively. The membrane detaches from the tube when the peeling force approaches 15 mN, which provides a passive safety guarantee of membrane peeling

13:45-13:50, Paper WeBT22.6
Optimized Design Method Based on Parallel Stepwise Hierarchical Constraints and Its Application for One-DOF Six-Bar Finger (I)

Liao, Jinnong	Harbin Institute of Technology
Liu, Gangfeng	Harbin Institute of Technology
Chen, Jinghua	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Keywords: Mechanism Design, Grippers and Other End-Effectors, Grasping Abstract: The function and adaptability of the manipulator is critical in specific operational tasks such as gripping, plugging, screwing, and shearing, among which gripping is a prerequisite for performing other functions. This paper proposes a linkage design method based on parallel stepwise hierarchical constraints (PSHC) that satisfies the design requirements of complex linkages while improving design efficiency and optimizing accuracy. By taking advantage of the simple control and high rigidity of one-DOF linkage, constraints such as reference point, closed-loop rotational angle, and envelope angle were applied by using the PSHC method to efficiently design multiple sets of Watt I linkage fingers with envelope angles greater than 180 degrees in about 20 minutes. A case in this paper achieved an envelope angle of 213.47 degrees, which is currently the largest among linkage fingers, and demonstrated a human-like appearance and great grasping performance. This study has good reference significance for the design of other single degree-of-freedom mechanisms and subsequent flexible hand designs for single-finger/multi-finger systems.

13:50-13:55, Paper WeBT22.7
A Novel Contact-Aided Continuum Robotic System: Design, Modeling, and Validation

Yang, Zheshuai	Xi'an Jiaotong University
Yang, Laihao	Xi'an Jiaotong University
Sun, Yu	Xi'an Jiaotong University
Chen, Xuefeng	Xi'an Jiaotong University
Keywords: Mechanism Design, Kinematics, Compliant Joint/Mechanism, Soft Robot Materials and Design Abstract: Tendon-driven continuum robots (TDCRs) are of great promise in dexterous manipulation in long-narrow spaces, such as in-situ maintenance of aero-engines, due to their slender body and compliant hyper-redundant architecture. However, major challenges in implementing this come from mechanical design and morphology estimation: (1) torsion and buckling issues induced by the intrinsic compliant architecture and the coupling of system gravity and distal loads; (2) low-accuracy morphology model influenced by complex load conditions. In this article, inspired by the contact-aided compliant mechanisms (CACMs), a novel continuum robotic system using the bearing-based CACM is developed to overcome the two intrinsic issues (i.e., torsion and buckling) while eliminating the implied wear due to friction at joint/socket interfaces without affecting its stiffness adversely. Subsequently, based on the chained beam constraint model, a comprehensive kinetostatic modeling framework is systematically derived, focusing on mechanism-oriented strategies (i.e., tendon routing friction, physical joint constraint, and section buckling estimation). Finally, various experiments are performed to verify the effe

13:55-14:00, Paper WeBT22.8
Multi-Bifurcation and Environmental Adaptability of a Novel Line-Symmetric Double-Centered Metamorphic Mechanism

Lin, Song	Shenyang Institute of Automation, Chinese Academy of Sciences
Dai, Jian	School of Natural and Mathematical Sciences, King's College Lond
Song, Yifeng	Shenyang Institute of Automation, Chinese Academy of Sciences
Wang, Hongguang	Shenyang Institute of Automation, Chinese AcademyofSciences
Yuan, Bingbing	Shenyang Institute of Automation, Chinese Academy of Sciences
Feng, YingBin	Shenyang Ligong University
Keywords: Mechanism Design, Kinematics, Wheeled Robots Abstract: To improve the environmental adaptability of the mechanism, this letter proposes a metamorphic mechanism configuration synthesis method based on environmental constraint characteristics. In particular, a novel 6R line-symmetric double-centered metamorphic mechanism is synthesized and the multi-bifurcation characteristics and environmental adaptability are analyzed and verified. Firstly, a configuration synthesis method is proposed based on the bifocal geometric constraints of the elliptical oblique section of the pipe. The variation of the instantaneous mobility with the number of links is derived and m=n=3 is the minimum number of links that the mechanism is movable. Then, for the 6R mechanism, through the closed-loop and higher-order kinematics, the joint velocity solution spaces of all the motion branches are revealed. Further, the correlation between the joint angle, configuration, and environment layers is demonstrated. The results show that the 6R mechanism has four motion branches (MB1, MB2, MB3, and MB4) with different topological structures corresponding to different pipe environments and four 4R serial motion branches (SMB1, SMB2, SMB3, and SMB4). Eventually, a wheeled metamorphic robot is designed, and the controllable motion branch transformation of the 6R mechanism and the pipe environment adaptability of the robot are proved by prototype experiments.


WeBT23	102B
Path Planning for Multiple Mobile Robots or Agents 2	Regular Session
Co-Chair: Ren, Zhongqiang	Shanghai Jiao Tong University

13:20-13:25, Paper WeBT23.1
Managing Conflicting Tasks in Heterogeneous Multi-Robot Systems through Hierarchical Optimization

De Benedittis, Davide	University of Pisa
Garabini, Manolo	Università Di Pisa
Pallottino, Lucia	Università Di Pisa
Keywords: Path Planning for Multiple Mobile Robots or Agents, Optimization and Optimal Control, Task and Motion Planning Abstract: The robotics research community has explored several model-based techniques for multi-robot and multi-task control. Through constrained optimization, robot-specific characteristics can be taken into account when controlling robots and accomplishing tasks. However, in scenarios with multiple conflicting tasks, existing methods struggle to enforce strict prioritization among them, allowing less important tasks to interfere with more important ones. In this paper, we propose a novel control framework that enables robots to execute multiple prioritized tasks concurrently while maintaining a strict task priority order. The framework exploits hierarchical optimization within a model predictive control structure. It formulates a convex minimization problem in which all the tasks are encoded as linear equality and inequality constraints. The proposed approach is validated through simulations using a team of heterogeneous robots performing multiple tasks.

13:25-13:30, Paper WeBT23.2
Multi-Agent Pickup and Delivery with Mobile Pickups

Flammini, Benedetta	Politecnico Di Milano
Hawes, Nick	University of Oxford
Lacerda, Bruno	University of Oxford
Keywords: Path Planning for Multiple Mobile Robots or Agents Abstract: In Multi-Agent Pickup and Delivery (MAPD), a team of agents must find collision-free paths to service an online stream of tasks, which are composed of pickup and delivery locations that have to be visited sequentially. This paper addresses the novel problem of MAPD with mobile pickups, which involves two types of agents, the suppliers and the deliverers. Suppliers are large robots that can transport many items, but cannot navigate tight spaces or manipulate objects, while deliverers can navigate to rooms to deliver items, but can only carry one item at a time. Deliverers have to collect items from the suppliers, and bring them to the assigned delivery locations. This introduces a new challenge which is not tackled in classical MAPD: deciding where and when the exchange of items should happen. We propose Token Passing with Exchange Locations (TP-EL), an extension of the widely used Token Passing (TP) algorithm with a task allocation mechanism that considers which supplier to pick items from, and when and where to do so. We experiment in several simulated domains, demonstrating the superiority of TP-EL over baselines that do not consider mobile pickups or use alternative methods to decide pickup locations.

13:30-13:35, Paper WeBT23.3
LMMCoDrive: Cooperative Driving with Large Multimodal Models

Liu, Haichao	The Hong Kong University of Science and Technology
Yao, Ruoyu	The Hong Kong University of Science and Technology (Guangzhou)
Huang, Zhenmin	The Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Task and Motion Planning, Optimization and Optimal Control Abstract: To address the intricate challenges of cooperative scheduling and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper introduces LMMCoDrive, a novel cooperative driving framework that leverages a Large Multimodal Model (LMM) to improve traffic efficiency and passenger experience in dynamic urban environments. This framework seamlessly integrates scheduling and motion planning processes to ensure the effective operation of Cooperative Autonomous Vehicles (CAVs). The spatial relationship between CAVs and passenger requests is abstracted into a Bird's-Eye View (BEV) image to fully exploit the potential of the multimodal understanding ability of LMMs. Besides, trajectories are cautiously refined for each CAV while ensuring collision avoidance through safety constraints. A decentralized optimization strategy, facilitated by the Alternating Direction Method of Multipliers (ADMM) within the LMM framework, is proposed to drive the graph evolution of CAVs. Simulation results in diverse urban scenarios demonstrate the pivotal role and significant impact of LMM in optimizing CAV scheduling and seamlessly serving a decentralized cooperative optimization process for each CAV. This marks a substantial stride towards practical, efficient, and safe AMoD systems that are poised to revolutionize urban transportation. The code is available at https://github.com/henryhcliu/LMMCoDrive.

13:35-13:40, Paper WeBT23.4
Mixed Integer Conic Programming for Multi-Agent Motion Planning in Continuous Space

Zhao, Shizhe	Shanghai Jiao Tong University
Liu, Yongce	Shanghai Jiao Tong University
Choset, Howie	Carnegie Mellon University
Ren, Zhongqiang	Shanghai Jiao Tong University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning Abstract: Multi-Agent Motion Planning (MAMP) seeks collision-free trajectories for multiple agents from their respective start to goal locations among static obstacles, while minimizing some cost function over the trajectories. Existing approaches for this problem include graph-based, Mix-Integer Programming (MIP) based and trajectory optimization-based, each with its own limitations. This paper introduces a new approach for MAMP based on Mixed Integer Conic Programming (MICP) formulation that complements these existing approaches. We show that our formulation is valid and test our approach against various baselines, including a graph-based method that combines search and sampling, as well as different MIP formulations. The numerical results show that the solutions found by our approach are sometimes eight times closer to the true optimum than the ones found by the baseline when given the same amount of runtime limit. We also verify our approach with multiple drones in a lab setting.

13:40-13:45, Paper WeBT23.5
ESCoT: An Enhanced Step-Based Coordinate Trajectory Planning Method for Multiple Car-Like Robots

Jiang, Junkai	Tsinghua University
Chen, Yihe	Tsinghua University
Yang, Yibin	Tsinghua University
Li, Ruochen	Tsinghua University
Xu, Shaobing	Tsinghua University
Wang, Jianqiang	Tsinghua University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: Multi-vehicle trajectory planning (MVTP) is one of the key challenges in multi-robot systems (MRSs) and has broad applications across various fields. This paper presents ESCoT, an enhanced step-based coordinate trajectory planning method for multiple car-like robots. ESCoT incorporates two key strategies: collaborative planning for local robot groups and replanning for duplicate configurations. These strategies effectively enhance the performance of step-based MVTP methods. Through extensive experiments, we show that ESCoT 1) in sparse scenarios, significantly improves solution quality compared to baseline step-based method, achieving up to 70% improvement in typical conflict scenarios and 34% in randomly generated scenarios, while maintaining high solving efficiency; and 2) in dense scenarios, outperforms all baseline methods, maintains a success rate of over 50% even in the most challenging configurations. The results demonstrate that ESCoT effectively solves MVTP, further extending the capabilities of step-based methods. Finally, practical robot tests validate the algorithm's applicability in real-world scenarios.

13:45-13:50, Paper WeBT23.6
GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning

le Fevre Sejersen, Jonas	Aarhus University
Suzumura, Toyotaro	University of Tokyo
Kayacan, Erdal	Paderborn University
Keywords: Multi-Robot Systems, Collision Avoidance, Path Planning for Multiple Mobile Robots or Agents Abstract: This paper presents a novel approach to multi-robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre-planned global paths, allowing robots to adhere to optimal routes while dynamically adjusting to environmental changes. The model’s robustness is enhanced through the introduction of noise during training, resulting in superior performance in complex, dynamic environments. Our approach is evaluated against established baselines, including NH-ORCA, DRL-NAV, and GA3C-CADRL, across various structurally diverse simulated scenarios. The results demonstrate that our model achieves consistently higher success rates, lower collision rates, and more efficient navigation, particularly in challenging scenarios where baseline models struggle. This work offers an advancement in multi-robot navigation, with implications for robust performance in complex, dynamic environments with varying degrees of complexity, such as those encountered in logistics, where adaptability is essential for accommodating unforeseen obstacles and unpredictable changes.

13:50-13:55, Paper WeBT23.7
Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning

Andreychuk, Anton	Artificial Intelligence Research Institute
Yakovlev, Konstantin	Federal Research Center "Computer Science and Control" of the Ru
Panov, Aleksandr	AIRI
Skrynnik, Alexey	AIRI
Keywords: Path Planning for Multiple Mobile Robots or Agents, Imitation Learning, Planning, Scheduling and Coordination Abstract: Multi-agent pathfinding (MAPF) is a common abstraction of multi-robot trajectory planning problems, where multiple homogeneous robots simultaneously move in the shared environment. While solving MAPF optimally has been proven to be NP-hard, scalable, and efficient, solvers are vital for real-world applications like logistics, search-and-rescue, etc. To this end, decentralized suboptimal MAPF solvers that leverage machine learning have come on stage. Building on the success of the recently introduced MAPF-GPT, a pure imitation learning solver, we introduce MAPF-GPT-DDG. This novel approach effectively fine-tunes the pre-trained MAPF model using centralized expert data. Leveraging a novel delta-data generation mechanism, MAPF-GPT-DDG accelerates training while significantly improving performance at test time. Our experiments demonstrate that MAPF-GPT-DDG surpasses all existing learning-based MAPF solvers, including the original MAPF-GPT, regarding solution quality across many testing scenarios. Remarkably, it can work with MAPF instances involving up to 1 million agents in a single environment, setting a new milestone for scalability in MAPF domains.

13:55-14:00, Paper WeBT23.8
Where to Wait: Postponing the Decision about Waiting Locations in Multi-Agent Path Finding

Zahrádka, David	Czech Institute of Informatics, Robotics and Cybernetics, Czech
Kulich, Miroslav	Czech Technical University in Prague
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Multi-Robot Systems Abstract: Multi-Agent Path Finding (MAPF) is a problem of finding collision-free paths for a group of agents in a shared discrete environment. The agents often need to wait in place for one or more discrete time steps to avoid each other, and they frequently have multiple locations where they can wait. While the locations may be equally good from the waiting agent's perspective, they impact the rest of the fleet because no other agent can pass through the location in the meantime. Where exactly an agent waits is decided while planning its path and only takes into consideration the agent's own preferences. Giving the other agents the option to influence the waiting location can improve the quality of solutions found by MAPF solvers, and in case of solvers which do not re-plan, even improve their success rate. We present the Partially Safe Interval (PSI) which allows to postpone the decision about exact waiting locations while preserving safety. PSI can be obtained by a simple post-processing procedure, and by following an exact set of rules, the waiting locations can be decided whenever an exact path is necessary or when there is only one remaining option. We demonstrate the benefits of PSI using an extension of the Prioritized Safe Interval Path Planning algorithm, which improves the average Sum of Delays by up to 4.12% and the success rate by up to 5% on benchmark maps. We also provide context for the improvement by comparing the results with the state-of-the-art suboptimal methods PIBT and LaCAM.


WeBT24	102C
Sensor Fusion 6	Regular Session
Chair: Ishikawa, Ryoichi	The University of Tokyo
Co-Chair: Zhu, Bing	Beihang University

13:20-13:25, Paper WeBT24.1
CBGTE: Neural Network Aided Extended Kalman Filter for Dual-Band Infrared Attitude Estimation (I)

Han, Xiaoyu Han	The School of Information Science and Technology, Nantong Univer
Xu, Miaomiao	Nantong University
Keywords: Sensor Fusion, Aerial Systems: Applications, AI-Based Methods Abstract: The dual-band infrared radiation (DBIR) attitude measurement technology is commonly applied to rotating flying projectiles (RFPs). However, RFPs are nonlinear complex systems characterized by multiple variables and strong coupling. Attitude information is easily disturbed by non-Gaussian strong noise, which reduces its reliability. Filtering algorithms based on minimum mean square error cannot effectively eliminate non-Gaussian strong noise, making it particularly challenging to obtain accurate and stable attitude information. To address this issue, we propose the mixture kernel correntropy induced cost for extend kalman filter (MKCIC_EKF). The performance of filters based on the maximum correlation criterion is related to the kernel type, size, and hyperparameters. 1) In terms of kernel type, the mixture kernel mechanism is selected as the induced cost function. It overcomes the defect of single-kernel, which is unable to filter out interference from different probability distributions. 2) In terms of kernel size, we develop an adaptive kernel size mechanism, overcoming the limitation of fixed kernel size. 3) In terms of kernel hyperparameters, the sine cosine algorithm is introduced to obtain optimal hyperparameters of MKCIC_EKF, thus achieving the best attitude estimation. A semi-physical experimental platform was established to verify the performance of MKCIC_EKF. Through experimental validation with real word data, compared with several state-of-the-art algorithms, the proposed MKCIC_EKF algorithm has significantly improved in terms of accuracy, achieving a pitch angle error of ±0.4° and a roll angle error of ±0.7°.

13:25-13:30, Paper WeBT24.2
Robust LiDAR-Camera Calibration with 2D Gaussian Splatting

Zhou, Shuyi	The University of Tokyo
Xie, Shuxiang	The University of Tokyo
Ishikawa, Ryoichi	The University of Tokyo
Oishi, Takeshi	The University of Tokyo
Keywords: Sensor Fusion, Calibration and Identification, Computer Vision for Transportation Abstract: LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.

13:30-13:35, Paper WeBT24.3
LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera

Xiong, Weiyi	Beihang University
Zou, Zean	Continental Autonomous Mobility (Shanghai) Co., Ltd
Zhao, Qiuchi	Beihang University
He, Fengchun	Continental Autonomous Mobility (Shanghai) Co., Ltd
Zhu, Bing	Beihang University
Keywords: Sensor Fusion, Deep Learning for Visual Perception Abstract: As the previous state-of-the-art 4D radar-camera fusion-based 3D object detection method, LXL utilizes the predicted image depth distribution maps and radar 3D occupancy grids to assist the sampling-based image view transformation. However, the depth prediction lacks accuracy and consistency, and the concatenation-based fusion in LXL impedes the model robustness. In this work, we propose LXLv2, where modifications are made to overcome the limitations and improve the performance. Specifically, considering the position error in radar measurements, we devise a one-to-many depth supervision strategy via radar points, where the radar cross section (RCS) value is further exploited to adjust the supervision area for object-level depth consistency. Additionally, a channel and spatial attention-based fusion module named CSAFusion is introduced to improve feature adaptiveness. Experimental results on the View-of-Delft and TJ4DRadSet datasets show that the proposed LXLv2 can outperform LXL in detection accuracy, inference speed and robustness, demonstrating the effectiveness of the model.

13:35-13:40, Paper WeBT24.4
Multi-Sensor Fusion for Quadruped Robot State Estimation Using Invariant Filtering and Smoothing

Nistico, Ylenia	IIT
Kim, Hajun	Korea Advanced Institute of Science and Technology
Soares, João Carlos Virgolino	Istituto Italiano Di Tecnologia
Fink, Geoff	Thompson Rivers University
Park, Hae-Won	Korea Advanced Institute of Science and Technology
Semini, Claudio	Istituto Italiano Di Tecnologia
Keywords: Sensor Fusion, Legged Robots, Localization Abstract: This letter introduces two multi-sensor state estimation frameworks for quadruped robots, built on the Invariant Extended Kalman Filter (InEKF) and Invariant Smoother (IS). The proposed methods, named E-InEKF and E-IS, fuse kinematics, IMU, LiDAR, and GPS data to mitigate position drift, particularly along the z-axis, a common issue in proprioceptive-based approaches. To integrate LiDAR odometry and GPS into InEKF and IS, we derived observation models that satisfy group-affine properties. LiDAR odometry is incorporated using Iterative Closest Point (ICP) registration on a parallel thread, preserving the computational efficiency of proprioceptive-based state estimation. We evaluate E-InEKF and E-IS with and without exteroceptive sensors, benchmarking them against LiDAR-based odometry methods in indoor and outdoor experiments using the KAIST HOUND2 robot. Our methods achieve lower Relative Position Errors (RPE) and significantly reduce Absolute Trajectory Error (ATE), with improvements of up to 28% indoors and 40% outdoors compared to LIO-SAM and FAST-LIO2. Additionally, we compare E-InEKF and E-IS in terms of computational efficiency and accuracy.

13:40-13:45, Paper WeBT24.5
State Estimation by Joint Approach with Dynamic Modeling and Observer for Soft Actuator

Ma, Huichen	Beijing Institute of Technology
Zhou, Junjie	Beijing Institute of Technology
Yeow, Chen-Hua	National University of Singapore
Meng, Lijun	Beijing Institute of Technology
Keywords: Sensor Fusion, Sensor-based Control, Dynamics Abstract: In order to achieve a significant reduction in state estimation error and improved convergence speed, ensuring real-time responsiveness and computational efficiency, this article proposes a joint approach that combines dynamic modeling and observers to achieve accurate nonlinear state estimation of the functional soft actuator. First, inspired by the viscoelastic model, a general framework for modeling the 2D dynamics of the pneumatic network soft actuator under external conditions was studied. The dimensionless dynamic model of the soft actuator's bending deformation is derived through the dimensional analysis method. Then, an adaptive extended Kalman particle filter (aEKPF) is used for state estimation. It can restrain noise from pressure sensors and reduce drift error from rate gyroscopes. The closed-loop performance of the nonlinear pose estimation combined with the conventional control method was experimentally assessed using soft actuators and soft crawling. Results show that the aEKPF can accurately estimate the state from noise sensor measurements. Compared with conventional EKF, aEKPF improves the performance by more than 50% in terms of state estimation error and convergence speed. At the same time, in the rectilinear crawling test, the mean centroid offset in different environments is less than 3% of the soft crawling module width, verifying the effectiveness and robustness of this strategy in accurate state estimation and stability control.

13:45-13:50, Paper WeBT24.6
GIVL-SLAM: A Robust and High-Precision SLAM System by Tightly Coupled GNSS RTK, Inertial, Vision, and LiDAR (I)

Wang, Xuanbin	Wuhan University
Li, Xingxing	Wuhan University
Yu, Hui	Wuhan University
Chang, Hanyu	Wuhan University
Zhou, Yuxuan	Wuhan University
Li, Shengyu	Wuhan University
Keywords: Sensor Fusion, SLAM, Mapping Abstract: In this article, we present GIVL-SLAM, a factor graph optimization-based framework that tightly fuses double-differenced pseudorange and carrier phase observations of the GNSS with inertial, visual, and LiDAR information for high-level simultaneous localization and mapping (SLAM) performance in large-scale environments. A sliding-window-based factor graph estimator is designed to explore the potential of heterogeneous observations from multiple sensors for achieving robust and high-accuracy state estimation. The integer ambiguity resolution of GNSS carrier phase observation is also considered in our method to leverage the high-precision characteristic of the carrier phase. We extensively evaluated the proposed method in real-world experiments including both GNSS-challenged environment tests and urban night environment tests. The results demonstrate that the proposed GIVL-SLAM significantly improves the global drift-free ability of the visual-inertial-LiDAR system in large-scale conditions and achieves continuous centimeter to decimeter-level localization performance in GNSS harsh-signal conditions. The maximum improvement of the 3-D location availability (<1 m) in GNSS severely degraded situations is more than 60% compared with the existing loosely coupled GNSS/SLAM fusion methods.

13:50-13:55, Paper WeBT24.7
Invariant-EKF-Based GNSS/INS/Vision Integration with High Convergence and Accuracy (I)

Xia, Chunxi	Wuhan University
Li, Xingxing	Wuhan University
Li, Shengyu	Wuhan University
Zhou, Yuxuan	Wuhan University
Keywords: Sensor Fusion, SLAM, Visual-Inertial SLAM Abstract: Nowadays, the tight integration of global navigation satellite system (GNSS), inertial navigation system (INS), and visual odometry has become a prevalent way to obtain continuous and drift-less pose estimation. Unfortunately, convergence, a prerequisite for the estimator to correctly fuse multiple-source information into a single coherent state estimate, remains challenging in complex and various operating environments. Moreover, certain beneficial information like GNSS multiple frequency resources, is generally untapped in such systems. To improve accuracy and convergence, we proposed an invariant extended Kalman filter (IEKF)-based framework that tightly couples stereo vision, INS, and GNSS, supporting both precise point positioning (PPP) and real-time kinematic (RTK) mode. By maximizing the advantage of triple-band GNSS measurements, the proposed system ensures high-precision pose estimation, shortens initialization process, and raises fixing rate. Meanwhile, the error propagation of the proposed system is log-linear and relatively independent with state prediction, contributing to its observability and convergence which are examined by both theoretical derivation and experimental Monte-Carlo tests. The open-sourced GVINS dataset and a field experiment are used for system evaluation in different GNSS observing environments, and the results indicate that the proposed system outperforms state-of-the-art approaches in both positioning accuracy and convergence capability under large initial perturbations.


WeBT25	103A
Legged Robots 2 - Learning	Regular Session
Co-Chair: Dorsey, Kristen	Northeastern University

13:20-13:25, Paper WeBT25.1
HAC-LOCO: Learning Hierarchical Active Compliance Control for Quadruped Locomotion under Continuous External Disturbances

Zhou, Xiang	Sun Yat-Sen University
Zhang, Xinyu	Sun Yat-Sen University
Wu, Tong	Harbin Institute of Technology
Zhang, Qingrui	Sun Yat-Sen University
Zhang, Lixian	Harbin Institute of Technology
Keywords: Legged Robots, Bioinspired Robot Learning, Compliance and Impedance Control Abstract: Despite recent remarkable achievements in quadruped control, it remains challenging to ensure robust and compliant locomotion in the presence of unforeseen external disturbances. Existing methods prioritize locomotion robustness over compliance, often leading to stiff, high-frequency motions, and energy inefficiency. This paper, therefore, presents a two-stage hierarchical learning framework that can learn to take active reactions to external force disturbances based on force estimation. In the first stage, a velocity-tracking policy is trained alongside an auto-encoder to distill historical proprioceptive features. A neural network-based estimator is learned through supervised learning, which estimates body velocity and external forces based on proprioceptive measurements. In the second stage, a compliance action module, inspired by impedance control, is learned based on the pre-trained encoder and policy. This module is employed to actively adjust velocity commands in response to external forces based on real-time force estimates. With the compliance action module, a quadruped robot can robustly handle minor disturbances while appropriately yielding to significant forces, thus striking a balance between robustness and compliance. Simulations and real-world experiments have demonstrated that our method has superior performance in terms of robustness, energy efficiency, and safety. Experiment comparison shows that our method outperforms the state-ofthe-art RL-based locomotion controllers. Ablation studies are given to show the critical roles of the compliance action module.

13:25-13:30, Paper WeBT25.2
ARC: Robots Adaptive Risk-Aware Robust Control Via Distributional Reinforcement Learning

Wu, Junlong	Tsinghua University
Cheng, Yi	Tsinghua University
Liu, Hang	University of Michigan
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Keywords: Legged Robots, Reinforcement Learning, Robust/Adaptive Control Abstract: Locomotion in robots remains an unsolved challenge, particularly for those with complex structures and dynamic environments. Consequently, the control systems for such robots must place greater emphasis on risk mitigation and safety considerations to ensure reliable and stable operation. Existing studies have explicitly incorporated risk factors into policy training, but lacked the ability to adaptively adjust the risk sensitivity for hazardous environments. This deficiency impacts the agent’s exploration during training and thus fails to select the optimal action. We innovatively introduce adaptive risk-aware control (ARC) policies based on Distributional Reinforcement Learning (Dist.RL), a novel framework that dynamically adjusts risk sensitivity levels in response to changing environmental conditions. Our approach uniquely integrates two key components: (1) the Inter Quartile Range (IQR) for quantifying intrinsic environmental uncertainty, and (2) Random Network Distillation (RND) for evaluating parameter uncertainty. This dual-mechanism architecture represents a significant advancement in risk assessment methodologies. Simulations conducted on a variety of robots have demonstrated that our method achieves significantly more robust performance compared to other approaches. Furthermore, sim2real validation on a humanoid robot confirms the practical viability of our approach.

13:30-13:35, Paper WeBT25.3
Dynamic Quadrupedal Legged and Aerial Locomotion Via Structure Repurposing

Wang, Chenghao	Northeastern University
Venkatesh Krishnamurthy, Kaushik	Northeastern University
Pitroda, Shreyansh	Northeastern University
Salagame, Adarsh	Northeastern University
Mandralis, Ioannis	Caltech
Sihite, Eric	California Institute of Technology
Ramezani, Alireza	Northeastern University
Morteza, Gharib	CALTECH
Keywords: Legged Robots, Biologically-Inspired Robots, Biomimetics Abstract: Multi-modal ground-aerial robots have been extensively studied, with a significant challenge lying in the integration of conflicting requirements across different modes of operation. The Husky robot family, developed at Northeastern University, and specifically the Husky v.2 discussed in this study, addresses this challenge by incorporating posture manipulation and thrust vectoring into multi-modal locomotion through structure repurposing. This quadrupedal robot features leg structures that can be repurposed for dynamic legged locomotion and flight. In this paper, we present the hardware design of the robot and report primary results on dynamic quadrupedal legged locomotion and hovering.

13:35-13:40, Paper WeBT25.4
Thruster-Enhanced Locomotion: A Decoupled Model Predictive Control with Learned Contact Residuals

Wang, Chenghao	Northeastern University
Ramezani, Alireza	Northeastern University
Keywords: Legged Robots, Machine Learning for Robot Control, Optimization and Optimal Control Abstract: Husky-β, a robot developed by Northeastern University, serves as a research platform to explore multi-modal legged-aerial locomotion through appendage repurposing. Unlike conventional quadrupeds, its expanded degrees of freedom enable enhanced control authority via thrusters during legged motion, facilitating lateral thruster-assisted locomotion. While a unified Model Predictive Control (MPC) framework optimizing both ground reaction forces and thruster force could theoretically address this control problem, its feasibility is limited by the low torque-control bandwidth of the system’s lightweight actuators. To overcome this challenge, we propose a decoupled control architecture: a Raibert-type controller governs legged locomotion using position-based control, while an MPC regulates the thrusters augmented by learned Contact Residual Dynamics (CRD) to account for leg-ground impacts. This separation bypasses the torque-control rate bottleneck while retaining the thruster MPC to explicitly account for leg-ground impact dynamics through learned residuals. We validate this approach through both simulation and hardware experiments, showing that the decoupled control architecture with CRD performs more stable behavior in terms of push recovery and cat-type walking gait compared to the decoupled controller without CRD.

13:40-13:45, Paper WeBT25.5
SF-TIM: A Simple Framework for Enhancing Quadrupedal Robot Jumping Agility by Combining Terrain Imagination and Measurement

Wang, Ze	Zhejiang University
Li, Yang	Deeprobotics
Xu, Long	Zhejiang University
Shi, Hao	Zhejiang University
Ma, ZunWang	DeepRobotics
Chu, Zhen	Deep Robotics
Li, Chao	Deep Robotics
Gao, Fei	Zhejiang University
Yang, Kailun	Hunan University
Wang, Kaiwei	Zhejiang University
Keywords: Legged Robots, Reinforcement Learning Abstract: Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Dynamic locomotion on abrupt surfaces, as opposed to walking on rough terrains, demands the integration of proprioceptive and exteroceptive perception to enable explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhances quadrupedal robot jumping agility, while preserving their fundamental blind walking capabilities. In addition, we introduce a terrain-guided reward design specifically to assist quadrupedal robots in high jumping, improving their performance in this task. To narrow the simulation-to-reality gap in quadrupedal robot learning, we introduce a stable and high-speed elevation map generation framework, enabling zero-shot simulation-to-reality transfer of locomotion ability. Our algorithm has been deployed and validated on both the small-/large-size quadrupedal robots, demonstrating its effectiveness in real-world applications: the robot has successfully traversed various high platforms and gaps, showing the robustness of our proposed approach. A demo video has been made available at https://flysoaryun.github.io/SF-TIM.

13:45-13:50, Paper WeBT25.6
Relative Tilt Suppression of a Carried Object Using Base Link Angle Adjustment on a Quadruped-Wheeled Robot

Kanno, Kimikage	Tokyo University of Agriculture and Technology
Hashimoto, Kenji	Waseda University
Mizuuchi, Ikuo	Tokyo University of Agriculture and Technology
Keywords: Legged Robots, Wheeled Robots, Dynamics Abstract: In this paper, we consider the case of carrying an object with a quadruped-wheeled robot and examine and verify the angle planning method for the robot’s base link (load section) that suppresses the relative tilt of the carried object. It is important to suppress the tilt of the carried object relative to the base link. We formulated the angle of the base link at which the carried object does not begin to tilt relative to the base link. We examined an angle planning method for the base link that would suppress the relative tilting of the carried object. Then, we verified the method using a simulation and a quadruped-wheeled robot: MELEW-3 (Meiji Leg-Wheeled Robot - No.3). As a result, both in the simulation and on the actual robot, we succeeded in carrying the object without turning over, and the base link was tilted based on the planning method.

13:50-13:55, Paper WeBT25.7
Imitation-Enhanced Reinforcement Learning with Privileged Smooth Transition for Hexapod Locomotion

Zhang, Zhelin	Harbin Institute of Technology
Liu, Tie	Harbin Institute of Technology
Ding, Liang	Harbin Institute of Technology
Wang, Haoyu	Harbin Institute of Technology
Xu, Peng	Harbin Institute of Technology
Yang, Huaiguang	Harbin Institute of Technology
Gao, Haibo	Harbin Institute of Technology
Deng, Zongquan	Harbin Institute of Technology
Pajarinen, Joni	Aalto University
Keywords: Legged Robots, Reinforcement Learning, Field Robots Abstract: Deep reinforcement learning (DRL) methods have shown significant promise in controlling the movement of quadruped robots. However, for systems like hexapod robots, which feature a higher-dimensional action space, it remains challenging for an agent to devise an effective control strategy directly. Currently, no hexapod robots have demonstrated highly dynamic motion. To address this, we propose imitation-enhanced reinforcement learning (IERL), a two-stage approach enabling hexapod robots to achieve dynamic motion through direct control using RL methods. Initially, imitation learning (IL) replicates a basic positional control method, creating a pre-trained policy for basic locomotion. Subsequently, the parameters from this model are utilized as the starting point for the reinforcement learning process to train the agent. Moreover, we incorporate a smooth transition (ST) method to make IERL overcome the changes in network inputs between two stages, and adaptable to various complex network architectures incorporating latent features. Extensive simulations and real-world experiments confirm that our method effectively tackles the high-dimensional action space challenges of hexapod robots, significantly enhancing learning efficiency and enabling more natural, efficient, and dynamic movements compared to existing methods.

13:55-14:00, Paper WeBT25.8
UniLegs: Universal Multi-Legged Robot Control through Morphology-Agnostic Policy Distillation

Xi, Weijie	Jiangnan University
Cao, Zhanxiang	Shanghai Jiao Tong University
Ming, Chenlin	Shanghai Jiao Tong University
Zheng, Jianying	Beihang University
Zhou, Guyue	Tsinghua University
Keywords: Legged Robots Abstract: Developing controllers that generalize across diverse robot morphologies remains a significant challenge in legged locomotion. Traditional approaches either create specialized controllers for each morphology or compromise performance for generality. This paper introduces a two-stage teacher-student framework that bridges this gap through policy distillation. First, we train specialized teacher policies optimized for individual morphologies, capturing the unique optimal control strategies for each robot design. Then, we distill this specialized expertise into a single Transformer-based student policy capable of controlling robots with varying leg configurations. Our experiments across five distinct legged morphologies demonstrate that our approach preserves morphology-specific optimal behaviors, with the Transformer architecture achieving 94.47% of teacher performance on training morphologies and 72.64% on unseen robot designs. Comparative analysis reveals that Transformer-based architectures consistently outperform MLP baselines by leveraging attention mechanisms to effectively model joint relationships across different kinematic structures. We validate our approach through successful deployment on a physical quadruped robot, demonstrating the practical viability of our morphology-agnostic control framework. This work presents a scalable solution for developing universal legged robot controllers that maintain near-optimal performance while generalizing across diverse morphologies.


WeBT26	103B
Localization 2	Regular Session
Chair: Englot, Brendan	Stevens Institute of Technology
Co-Chair: Fischer, Tobias	Queensland University of Technology

13:20-13:25, Paper WeBT26.1
ThermalLoc: A Vision Transformer-Based Approach for Robust Thermal Camera Relocalization in Large-Scale Environments

Liu, Yu	National University of Defense Technology
Meng, Yangtao	National University of Defense Technology
Pan, Xianfei	National University of Defense Technology
Jiang, Jie	National University of Defense Technology, College of Intelligen
Chen, Changhao	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Localization, SLAM, Data Sets for SLAM Abstract: Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.

13:25-13:30, Paper WeBT26.2
LGPR: Local Feature Learning Brings More Generalizable Visual Place Recognition

Su, Shuai	Tongji University, China
Yang, Jingwei	Tongji University
Du, Jiayuan	Tongji University
Pan, Xianghui	Tongji University
Liu, Chengju	Tongji University
Chen, Qijun	Tongji University
Keywords: Localization, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: We propose a Visual Place Recognition (VPR) framework by sharing lightweight keypoint extraction modules for local features. Current research on the joint learning of local keypoint matching and VPR is relatively scarce, and the application deployment of real-time spatial computing on edge devices has a high learning cost. There is also a significant spatial structural difference between existing VPR methods and the scenarios in practical applications. To address these issues, we design a joint learning framework for local keypoint extraction and VPR, which shares local features and fuses irregularly distributed key features in space through self-attention and cross-attention mechanisms. Our framework achieves excellent results on several VPR datasets. In particular, we introduce a new VPR dataset, called TJPark, which has a significant spatial information difference from common street view data. Our method demonstrates that local features with strong generalization capabilities effectively help enhance the generalization of VPR. Our open source code and dataset are available at: https://github.com/ShuaiAlger/LGPR.

13:30-13:35, Paper WeBT26.3
Efficient End-To-End Visual Localization for Autonomous Driving with Decoupled BEV Neural Matching

Miao, Jinyu	Tsinghua University
Wen, Tuopu	Tsinghua University
Luo, Ziang	TsingHua University
Qian, Kangan	Tsinghua University
Fu, Zheng	Tsinghua University
Wang, Yulong	Tsinghua University
Jiang, Kun	Tsinghua University
Yang, Mengmeng	Tsinghua University
Huang, Jin	Tsinghua University
Zhong, Zhihua	Tsinghua University
Yang, Diange	Tsinghua University
Keywords: Localization, SLAM, Representation Learning Abstract: Accurate localization plays an important role in high-level autonomous driving systems. Conventional map matching-based localization methods solve the poses by explicitly matching map elements with sensor observations, generally sensitive to perception noise, therefore requiring costly hyper-parameter tuning. In this paper, we propose an end-to-end localization neural network which directly estimates vehicle poses from surrounding images, without explicitly matching perception results with HD maps. To ensure efficiency and interpretability, a decoupled BEV neural matching-based pose solver is proposed, which estimates poses in a differentiable sampling-based matching module. Moreover, the sampling space is hugely reduced by decoupling the feature representation affected by each DoF of poses. The experimental results demonstrate that the proposed network is capable of performing decimeter level localization with mean absolute errors of 0.19m, 0.13m and 0.39 degree in longitudinal, lateral position and yaw angle while exhibiting a 68.8% reduction in inference memory usage.

13:35-13:40, Paper WeBT26.4
KDMOS: Knowledge Distillation for Motion Segmentation

Cao, Chunyu	South China Normal University
Cheng, Jintao	South China Normal University
Chen, Zeyu	South China Normal University
Zhan, Linfan	South China Normal University
He, Zhijian	Shenzhen Technology University
Fan, Rui	Tongji University
Tang, Xiaoyu	South China Normal University
Keywords: Localization, Range Sensing, Mapping Abstract: Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.

13:40-13:45, Paper WeBT26.5
Fast(er) Robust Point Cloud Alignment Using Lie Algebra

Sexton, Jean-Thomas	Université Laval
Morin, Michael	Université Laval
Giguère, Philippe	Université Laval
Gaudreault, Jonathan	Laval University
Keywords: Localization, Probability and Statistical Methods, Software Tools for Robot Programming Abstract: We present a novel Lie algebra based Iterative Reweighted Least Squares (IRLS) algorithm for robust 3D point cloud alignment. We reformulate the optimal update computation to a compact form which requires only one pass through the data. Although this reformulation does not alter the asymptotic computational complexity, it is well-suited for contemporary hardware architectures, yielding significant practical speedups. In extensive experiments on challenging benchmark datasets with added correspondence corruption, the method is consistently at least four times faster than previous literature whilst being mathematically equivalent, demonstrating it is well suited for time-critical applications.

13:45-13:50, Paper WeBT26.6
CVD-SfM: A Cross-View Deep Front-End Structure-From-Motion System for Sparse Localization in Multi-Altitude Scenes

Li, Yaxuan	Stevens Institute of Technology
Huang, Yewei	Dartmouth College
Gaudel, Bijay	Stevens Institute of Technology
Jafarnejadsani, Hamidreza	Stevens Institute of Technology
Englot, Brendan	Stevens Institute of Technology
Keywords: Localization, Computer Vision for Automation, Data Sets for Robotic Vision Abstract: We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

13:50-13:55, Paper WeBT26.7
Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments

Gorry, Beverley	Queensland University of Technology, QUT Centre for Robotics
Fischer, Tobias	Queensland University of Technology
Milford, Michael J	Queensland University of Technology
Fontan, Alejandro	Queensland University of Technology
Keywords: Localization Abstract: Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on images extracted from video sequences. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark—the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories with varying overlap and diverse seafloor types captured under different environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc

13:55-14:00, Paper WeBT26.8
Applicability Analysis for Optical Cooperative Localization

Li, Yixian	Beijing Institute of Technology
Wang, Qiang	Beijing Institute of Technology
Wu, Jiaxing	Beijing Institute of Technology
Zhao, Wuhong	Beijing Institute of Technology
Hu, Shengrong	Beijing Institute of Technology
Hao, Zhonghu	Beijing Institute of Technology
Keywords: Localization Abstract: For optical cooperative localization, which employs optical beacons with prior features as cooperative targets, a fundamental prerequisite is to ensure that the beacons are always captured by the vision sensors during the entire localization process. In other words, there is an applicability issue of optical cooperative localization with respect to the relative range between beacons and vision sensors, whereas the corresponding analysis method has so far remained a gap. In this work, we propose a general applicability analysis method for optical cooperative localization to fill this gap. We translate this problem into constructing a multi-constraint model incorporating geometrics and radiometrics for describing the relationship between optical sensor parameters and relative range or depth. For parameterized beacons and vision sensors, the geometric constraint is related to the imaging quantities and the radiometric constraint is determined by the radiation properties. Numerical evaluations are performed based on the range of parameters in practice, and real-world experiments are conducted to validate the effectiveness of the proposed applicability analysis. The results demonstrate the effectiveness of the proposed applicability analysis method and are instructive for real-world deployment of optical cooperative localization.


WeBT27	103C
Performance Evaluation and Benchmarking 2	Regular Session
Chair: Kolyubin, Sergey	ITMO University
Co-Chair: Zhu, Jihong	University of York

13:20-13:25, Paper WeBT27.1
IGDrivSim: A Benchmark for the Imitation Gap in Autonomous Driving

Grislain, Clémence	Sorbonne University
Vuorio, Risto	University of Oxford
Lu, Cong	University of British Columbia
Whiteson, Shimon	University of Oxford
Keywords: Reinforcement Learning, Imitation Learning, Simulation and Animation Abstract: Developing autonomous vehicles that can navigate complex environments with human-level safety and efficiency is a central goal in self-driving research. A common approach to achieving this is imitation learning, where agents are trained to mimic human expert demonstrations collected from real- world driving scenarios. However, discrepancies between human perception and the self-driving car’s sensors can introduce an imitation gap, leading to imitation learning failures. In this work, we introduce IGDrivSim, a benchmark built on top of the Waymax simulator, designed to investigate the effects of the imitation gap in learning autonomous driving policy from human expert demonstrations. Our experiments show that this perception gap between human experts and self- driving agents can hinder the learning of safe and effective driving behaviors. We further show that combining imitation with reinforcement learning, using a simple penalty reward for prohibited behaviors, effectively mitigates these failures. All code developed for this work is released as open source at https://github.com/clemgris/IGDrivSim.

13:25-13:30, Paper WeBT27.2
Benchmark for Evaluating Long-Term Localization in Indoor Environments under Substantial Static and Dynamic Scene Changes

Trekel, Niklas	University of Bonn
Guadagnino, Tiziano	University of Bonn
Läbe, Thomas	University of Bonn
Wiesmann, Louis	University of Bonn
Aguiar, Perrine	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Keywords: Localization, Performance Evaluation and Benchmarking Abstract: Accurate localization is crucial for the autonomous operation of mobile robots. Specifically for indoor scenarios, localization algorithms typically rely on a previously generated map. However, many real-world sites like warehouses or healthcare environments violate the underlying assumption that the robot's surroundings are mainly static. In this paper, we introduce a new dataset plus a benchmark that enables evaluating and comparing indoor localization methods in complex and changing real-world scenarios. While several datasets for indoor scenes exist, only a few combine the long-term localization aspect of repeatedly revisiting the same environment under varying conditions with precise ground truth over multiple rooms. Our dataset comprises various sequences recorded with a wheeled robot covering an office environment. We provide data from two 2D LiDARs, multiple consumer-grade RGB-D cameras, and the robot's wheel odometry. By densely placing fiducial markers on every room ceiling, we can also provide accurate pose information within a single global frame for the whole environment, estimated through an additional upward-facing camera. We evaluate existing localization algorithms on our data and make the dataset together with a server-based benchmark evaluation publicly available. This facilitates an unbiased evaluation of localization approaches and enables further research on their application in challenging indoor scenarios.

13:30-13:35, Paper WeBT27.3
A Multi-Modal Benchmark for Long-Range Depth Evaluation in Adverse Weather Conditions

Walz, Stefanie	Torc Robotics
Ramazzina, Andrea	Mercedes-Benz AG - Technical University of Munich
Scheuble, Dominik	Mercedes-Benz AG
Brucker, Samuel	Torc Robotics
Zuber, Alexander	Mercedes-Benz
Ritter, Werner	Daimler AG
Bijelic, Mario	Princeton University
Heide, Felix	Princeton University & Torc Robotics
Keywords: RGB-D Perception, Data Sets for Robotic Vision, Deep Learning Methods Abstract: Depth estimation is a cornerstone computer vision application, critical for scene understanding and autonomous driving. In real-world scenarios, achieving reliable depth perception under adverse weather—e.g. in fog and rain—is crucial to ensure safety and system robustness. However, quantitatively evaluating the performances of depth estimation methods in these scenarios is challenging due to the difficulty of obtaining ground truth data. A promising approach is the use of weather chambers for simulating diverse weather conditions in a controlled environment. However, current datasets are limited in distance and lack a dense ground truth. To address this gap, we introduce a novel evaluation benchmark that extends depth evaluation up to 200 meters under clear, foggy, and rainy conditions. To this end, we employ a multi-modal sensor setup, including state-of-the-art stereo RGB, RCCB, Gated camera systems, and a long-range LiDAR sensor. Moreover, using a high-end laser scanner, we collect a dense ground truth geometry with millimeter-level accuracy. This comprehensive benchmark allows for the evaluation of different models as well as multiple sensing modalities, in a more precise and accurate manner also at far distances. Data and code will be released upon publication.

13:35-13:40, Paper WeBT27.4
OSMa-Bench: Evaluating Open Semantic Mapping under Varying Lighting Conditions

Popov, Maxim	ITMO University
Kurkova, Regina	ITMO University
Iumanov, Mikhail	ITMO University
Mahmoud, Jaafar	ITMO University
Kolyubin, Sergey	ITMO University
Keywords: Performance Evaluation and Benchmarking, Semantic Scene Understanding, Mapping Abstract: Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ, and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Our code is available at https://be2rlab.github.io/OSMa-Bench/.

13:40-13:45, Paper WeBT27.5
ConfigBot: Adaptive Resource Allocation for Robot Applications in Dynamic Environments

Dwivedula, Rohit	UT Austin
Modak, Sadanand	The University of Texas at Austin
Akella, Aditya	UT Austin
Biswas, Joydeep	The University of Texas at Austin
Kim, Daehyeok	The University of Texas at Austin
Rossbach, Christopher	The University of Texas at Austin
Keywords: Performance Evaluation and Benchmarking, Failure Detection and Recovery, Software Tools for Benchmarking and Reproducibility Abstract: The growing use of service robots in dynamic environments requires flexible management of on-board compute resources to optimize the performance of diverse tasks such as navigation, localization, and perception. Current robot deployments often rely on static OS configurations and system over-provisioning. However, they are suboptimal because they do not account for variations in resource usage. This results in poor system-wide behavior such as robot instability or inefficient resource use. This paper presents ConfigBot, a novel system designed to adaptively reconfigure robot applications to meet a predefined performance specification by leveraging runtime profiling and automated configuration tuning. Through experiments on multiple real robots, each running a different stack with diverse performance requirements, which could be context-dependent, we illustrate ConfigBot's efficacy in maintaining system stability and optimizing resource allocation. Our findings highlight the promise of automatic system configuration tuning for robot deployments, including adaptation to dynamic changes. Source code is available at: https://github.com/ldos-project/configbot

13:45-13:50, Paper WeBT27.6
Investigating the Fitness of Finger Grippers for Dynamic Tactile Manipulation under Static Object Conditions

Yildirim, Mehmet Can	Technical University of Munich
Choong, Dee Hva	Technical University of Munich
Ringwald, Johannes	Technische Universität München
Kirschner, Robin Jeanne	TU Munich, Institute for Robotics and Systems Intelligence
Le Mesle, Valentin	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Performance Evaluation and Benchmarking, Grippers and Other End-Effectors Abstract: Robotic system development must adopt a holistic approach for tactile and dynamic tasks, shifting from the decoupled design of end-effectors and robot manipulators for traditional sequential tasks. Although established metrics exist for traditional tasks, such as pick-and-place, they lack the nuanced evaluation required for dynamic and tactile operations. Accordingly, this paper introduces an integrated framework that defines and unifies decoupled and coupled gripper metrics into a single perspective. We categorise gripper metrics based on their interaction with the robot manipulator, which can be entirely decoupled, coupled by time-sequence, or coupled. Using this classification, we propose 16 metrics to evaluate force control, force reaction, and efficiency. We introduce three new experimental setups and describe the corresponding procedures to quantify these metrics. Results from three commercial finger grippers demonstrate the efficacy of the proposed metrics, revealing each gripper’s strengths and limitations when integrated into different manipulator systems. Incorporating these metrics into performance reviews provides a comprehensive evaluation of robotic system fitness, considering dynamic, real-time challenges. This supports informed design choices and enhances tactile manipulation tasks.


WeBT28	104
Marine Robotics 6	Regular Session
Chair: Zhong, Hang	Hunan University

13:20-13:25, Paper WeBT28.1
Dual-Mode Passive Fault-Tolerant Control for Underwater Vehicles with Actuator Faults and Time-Varying Disturbances

Chen, Yizong	Hunan University
Wei, Jun	Hunan University
Miao, Zhiqiang	Hunan University
Liu, Kangcheng	Hunan University (HNU); Previously with the California Institute
Wang, Yaonan	Hunan University
Keywords: Marine Robotics, Robust/Adaptive Control, Failure Detection and Recovery Abstract: This paper investigates the control problem of underwater vehicles subject to time-varying external disturbances and actuator faults. A novel passive fault-tolerant control (PFTC) scheme is developed to address the coupled disturbance-fault dynamics inherent in underwater vehicle systems. The proposed dual-mode architecture comprises: 1) a robust fault-tolerant control scheme based on high-order sliding mode observers (HOSMOs) for minor fault scenarios, which effectively compensates for bounded disturbances and partial actuator degradation; 2) a conditionally triggered estimation mechanism integrated with fault-tolerant control allocation (FTCA) and HOSMOs for severe fault conditions, enabling fault estimation and model compensation via event-triggered parameter updating. The hybrid architecture ensures computational efficiency by activating the estimation module only when predefined triggering conditions are violated. Comprehensive experimental results validate the superiority of the proposed method in maintaining stability and performance under various fault conditions. This work provides a systematic solution for underwater vehicle control under coupled disturbance-fault conditions, with verified real-time performance and implementation feasibility.

13:25-13:30, Paper WeBT28.2
MVINS: A Magnetism&Vision Aided Inertial Navigation System for Autonomous Underwater Vehicles

Zhang, Bingbing	Zhejiang University
Liu, Shuo	Zhejiang University
Ji, Daxiong	Zhejiang University
Wang, Tao	Zhejiang University
Zhou, Shanmin	Zhejiang University
Wang, Zhengfei	Zhejiang University
Qi, Xiaokang	Zhejiang University
Xu, Wen	Zhejiang University
Keywords: Marine Robotics, Sensor Fusion, SLAM Abstract: We present a robust underwater navigation system that integrates magnetic, visual, and inertial measurements from commercial off-the-shelf sensors. Visual Inertial Navigation Systems (VINS) face challenges when used for Autonomous Underwater Vehicle (AUV) localization in perceptually degraded environments. First, traditional VINS methods struggle to accurately detect sufficient loops due to several factors: feature scarcity, environmental similarities, limited visibility, orientation changes, and constrained computational resources. Second, the yaw is unobservable in VINS and it may drift rapidly without distinct features. To address these issues, we propose a novel system that enhances loop closure by fusing magnetic signatures from a low-cost alternating magnetic field coil with multi-scale mapping and hierarchical place recognition. Additionally, we utilize geomagnetic fields to align feature descriptors, improving robustness to orientation variations. Our system also refines yaw estimations by leveraging geomagnetic data, aligning them with global references to mitigate drift. Experimental results validate the improved performance of the proposed system.

13:30-13:35, Paper WeBT28.3
Decentralized Gaussian Process Classification and an Application in Subsea Robotics

Gao, Yifei	Virginia Tech
He, Hans	Virginia Tech
Stilwell, Daniel	Virginia Tech
McMahon, James	The Naval Research Laboratory
Keywords: Marine Robotics Abstract: Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem -- communication events are either successful or unsuccessful -- where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.

13:35-13:40, Paper WeBT28.4
Reservoir Computing-Enhanced Tube-MPC: Real-Time Self-Healing Control for Robust AUV Path Following under Dynamic Faults

Xu, Lie	Zhejiang University
Ji, Daxiong	Zhejiang University
Tan, Yan Zhi	National University of Singapore
Goh, Eng Wei	BEEX
Ang Jr, Marcelo H	National University of Singapore
Keywords: Marine Robotics Abstract: This paper presents a novel control framework that integrates reservoir computing (RC) with Tube model predictive control (Tube-MPC) for robust path following in quadrotor autonomous underwater vehicles (QAUVs) under sudden fault conditions. The proposed RC-Tube-MPC leverages the dynamic modeling capabilities of RC to efficiently approximate complex nonlinear behaviors, while Tube correction ensures robust performance despite model uncertainties and external disturbances. Comparative simulations demonstrate that RC-Tube-MPC outperforms alternative approaches in terms of path following accuracy and computational efficiency. Additionally, the influence of training data length on learning performance is analyzed, revealing that the proposed method maintains superior performance across various data regimes. Notably, in severe fault scenarios, such as a fault factor of 0.3, RC-Tube-MPC uniquely restores convergence to the reference path. These results underscore the potential of the integrated RC-Tube-MPC approach for real-time control applications in dynamic, fault-prone underwater environments.

13:40-13:45, Paper WeBT28.5
A Tightly Coupled Inertial-Sonar Fusion for Localization of Underwater Robots

Jibo, Bai	University of Shanghai for Science and Technology,
Zhu, Daqi	USST
Mingzhi, Chen	The School of Mechanical Engineering, University of Shanghai For
Yuan, Liu	The School of Mechanical Engineering, University of Shanghai For
Hongfei, Li	The School of Mechanical Engineering, University of Shanghai For
Keywords: Marine Robotics Abstract: This paper proposes a tightly coupled fusion method for inertial and forward-looking sonar (FLS) data, integrating underwater image observations from the FLS into the inertial odometry for underwater robot localization. Since the FLS images provide only horizontal plane information, this work focuses on the positioning of the underwater robot in a 2D plane. In underwater navigation systems, relying solely on inertial measurements often leads to error accumulation and suboptimal localization results, as demonstrated in previous studies. To address this issue, we integrate the FLS data into the inertial odometry. Specifically, we convert the sonar images into 2D underwater point clouds and use an Error State Kalman Filter (ESKF) to fuse Inertial Measurement Unit (IMU) data with the sonar point cloud data for joint estimation of the initial pose. Next, edge feature point clouds are extracted from the sonar images using a horizontal scanning method. Finally, by constructing edge feature error terms, we constrain the relative position changes between two adjacent sonar frames. Through experiments in an underwater simulation environment (Dave) and a real pool, the results show that the proposed fusion method can significantly improve the localization accuracy of underwater robots compared to using inertial odometers and sonar odometers alone.

13:45-13:50, Paper WeBT28.6
RUSSO: Robust Underwater SLAM with Sonar Optimization against Visual Degradation (I)

Pan, Shu	Harbin Institute of Technology, Shenzhen
Hong, Ziyang	Heriot-Watt University
Hu, Zhangrui	Harbin Institute of Technology, Shenzhen
Xu, Xiandong	Tianjin University
Lu, Wenjie	Harbin Institute of Technology (Shenzhen)
Hu, Liang	Harbin Institute of Technology, Shenzhen
Keywords: Marine Robotics, SLAM, Mapping Abstract: Visual degradation in underwater environments poses unique and significant challenges, which distinguishes underwater SLAM from popular vision-based SLAM on the ground. In this paper, we propose RUSSO, a robust underwater SLAM system which fuses stereo camera, inertial measurement unit (IMU), and imaging sonar to achieve robust and accurate localization in challenging underwater environments for 6 degrees of freedom (DoF) estimation. During visual degradation, the system is reduced to a sonar-inertial system estimating 3-DoF poses. The sonar pose estimation serves as a strong prior for IMU propagation, thereby enhancing the reliability of pose estimation with IMU propagation. Additionally, we propose a SLAM initialization method that leverages the imaging sonar to counteract the lack of visual features during the initialization stage of SLAM. We extensively validate RUSSO through experiments in simulator, pool, and sea scenarios. The results demonstrate that RUSSO achieves better robustness and localization accuracy compared to the state-of-the-art visual-inertial SLAM systems, especially in visually challenging scenarios. To the best of our knowledge, this is the first time fusing stereo camera, IMU, and imaging sonar to realize robust underwater SLAM against visual degradation.

13:50-13:55, Paper WeBT28.7
Adaptive Sndo-Stsm Hierarchical Robust Control of Autonomous Underwater Vehicle: Theory and Experimental Validation (I)

Xiong, Xinyang	Huazhong University of Science and Technology
Xiang, Xianbo	Huazhong University of Science and Technology
Duan, Yu	Huazhong University of Science and Technology
Yang, Shaolong	Huazhong University of Science and Technology
Keywords: Marine Robotics Abstract: This article mainly investigates the depth tracking problem for an underactuated autonomous underwater vehicle (AUV), considering system dynamic uncertainties and unknown external disturbances. First, a three-layer hierarchical control system is developed for the underactuated AUV, enhancing system flexibility and control performance. Second, an adaptive line-of-sight guidance law is presented, which estimates the attack angle and mitigates the influence of unknown disturbances on the guidance layer. Third, a finite-time control law based on second-order sliding-mode algorithm is proposed along with a modified sliding surface-based nonlinear observer for approximating complex uncertainties. This provides additional robustness to overcome intense transient impact. Furthermore, the sensitivity to precise modeling and the chattering phenomenon in actuators are relaxed. In addition, an adaptive saturation compensator is given to overcome the adverse effects caused by control input saturation constraints without singularities. Finally, the pro- posed hierarchical control system is validated in practice based on a lightweight AUV prototype. Comparative exper- iments are conducted under the influence of various types of complicated time-varying disturbances. The experimen- tal results demonstrate the effectiveness and advantages of the proposed control system in practical scenarios.

13:55-14:00, Paper WeBT28.8
OASIS: Real-Time Opti-Acoustic Sensing for Intervention Systems in Unstructured Environments

Phung, Amy	MIT-WHOI Joint Program
Camilli, Richard	Woods Hole Oceanographic Institution
Keywords: Marine Robotics, Field Robots, Mapping Abstract: High resolution underwater 3D scene reconstruction is crucial for various applications, including construction, infrastructure maintenance, monitoring, exploration, and scientific investigation. Prior work has leveraged the complementary sensing modalities of imaging sonars and optical cameras for opti-acoustic 3D scene reconstruction, demonstrating improved results over methods which rely solely on either sensor. However, while most existing approaches focus on offline reconstruction, real-time spatial awareness is essential for both autonomous and piloted underwater vehicle operations. This paper presents OASIS, an opti-acoustic fusion method that integrates data from optical images with voxel carving techniques to achieve real-time 3D reconstruction unstructured underwater workspaces. Our approach utilizes an “eye-in-hand” configuration, which leverages the dexterity of robotic manipulator arms to capture multiple workspace views across a short baseline. We validate OASIS through tank-based experiments and present qualitative and quantitative results that highlight its utility for underwater manipulation tasks.


WeBT29	105
SLAM: Localization 1	Regular Session
Co-Chair: Shao, Shiliang	SIA

13:20-13:25, Paper WeBT29.1
KINND: A Keyframe Insertion Framework Via Neural Network Decision-Making for VSLAM

Dong, Yanchao	Tongji University
Li, Peitong	Tongji University
Zhang, Lulu	Tongji University
Zhou, Xin	Tongji University
He, Bin	Tongji University
Tang, Jie	Tongji University
Keywords: SLAM, Deep Learning Methods Abstract: Keyframe insertion is critical for the performance and robustness of SLAM systems. However, traditional heuristic-based methods often lead to suboptimal keyframe selection, compromising the accuracy of localization and mapping. To address this, we propose KINND, a lightweight neural network-based framework for real-time keyframe insertion. The framework introduces a novel foundational paradigm for learning-based keyframe insertion, encompassing the model architecture and training methodology. A neural network model is designed using a hierarchical weighted self-attention mechanism to encode real-time SLAM state information into high-dimensional representations, producing keyframe insertion decisions. To overcome the absence of ground truth for keyframe insertion, a composite loss function is developed by integrating pose error and system state information, providing a metric for this task. Additionally, a novel training mode enhances the model’s real-time decision-making capabilities. Experimental results on public and private datasets demonstrate that KINND operates in real time without requiring a GPU and, with a single training session on a public dataset, achieves superior generalization performance on other datasets. The code is publicly available at https://github.com/peitonglee/KINND.

13:25-13:30, Paper WeBT29.2
ILoc: An Adaptive, Efficient, and Robust Visual Navigation System

Yin, Peng	City University of Hong Kong
Zhao, Shiqi	City University of Hong Kong
Wang, Jing	City University of Hong Kong
Ge, Ruohai	Carnegie Mellon Univeristy
Ji, Jianmin	University of Science and Technology of China
Hu, Yeping	University of California, Berkeley
Liu, Huaping	Tsinghua University
Han, Jianda	Nankai University
Keywords: SLAM, Localization, Autonomous Vehicle Navigation, Recognition Abstract: We introduce iLoc, an innovative visual navigation system designed to enhance the autonomy and adaptability of robotic agents in long-term and large-scale applications. iLoc specializes in: 1) Extracting stable and consistent descriptors for place recognition, unaffected by changes in viewpoint and illumination. 2) Performing swift and precise global relocalization to establish a robot's position within a large and complex environment. 3) Generating real-time tracking trajectories aligned with reference maps, ensuring continual orientation within known spaces. Distinctively, iLoc incorporates a transformer-based learning module and an attention-enhanced recognition approach, enabling it to adapt to diverse environmental and viewpoint conditions. iLoc leverages a coarse-to-fine global feature matching technique for enhanced localization and integrates robust state estimation combining visual odometry and loop closures through local refinement and pose graph optimization. iLoc demonstrates remarkable proficiency in place recognition, achieving localization over distances of up to 2km within 0.5 second with average accuracy at 0.5m. It maintains stable localizatio

13:30-13:35, Paper WeBT29.3
RaI-SLAM: Radar-Inertial SLAM for Autonomous Vehicles

Casado Herraez, Daniel	University of Bonn & CARIAD SE
Zeller, Matthias	CARIAD SE
Wang, Dong	University of Würzburg
Behley, Jens	University of Bonn
Heidingsfeld, Michael	CARIAD SE
Stachniss, Cyrill	University of Bonn
Keywords: SLAM, Localization, Autonomous Vehicle Navigation Abstract: Simultaneous localization and mapping are essential components for the operation of autonomous vehicles in unknown environments. While localization focuses on estimating the vehicle's pose, mapping captures the surrounding environment to enhance future localization and decision-making. Localization is commonly achieved using external GNSS systems combined with inertial measurement units, LiDARs, and/or cameras. Automotive radars offer an attractive onboard sensing alternative due to their robustness to adverse weather and low lighting conditions, compactness, affordability, and widespread integration into consumer vehicles. However, they output comparably sparse and noisy point clouds that are challenging for pose estimation, easily leading to noisy trajectory estimates. We propose a modular approach that performs radar-inertial SLAM by fully leveraging the characteristics of automotive consumer-vehicle radar sensors. Our system achieves smooth and accurate onboard simultaneous localization and mapping by combining automotive radars with an IMU and exploiting the additional velocity and radar cross-section information provided by radar sensors, without relying on GNSS data. Specifically, radar scan-matching and IMU measurements are first incorporated into a local pose graph for odometry estimation. We then correct the accumulated drift through a global pose graph backend that optimizes detected loop closures. Contrary to existing radar SLAM methods, our graph-based approach is divided into distinct submodules and all components are designed specifically to exploit the characteristics of automotive radar sensors for scan matching and loop closure detection, leading to enhanced system performance. Our method achieves state-of-the-art accuracy on public autonomous driving data.

13:35-13:40, Paper WeBT29.4
Get It for Free: Radar Segmentation without Expert Labels and Its Application in Odometry and Localization

Li, Siru	Harbin Institute of Technology, Shenzhen
Hong, Ziyang	Heriot-Watt University
Chen, YuShuai	Harbin Institute of Technology, Shenzhen
Hu, Liang	Harbin Institute of Technology, Shenzhen
Qin, Jiahu	University of Science and Technology of China
Keywords: SLAM, Localization, Deep Learning for Visual Perception Abstract: This paper presents a novel weakly supervised semantic segmentation method for radar segmentation, where the existing LiDAR semantic segmentation models are employed to generate semantic labels, which then serve as supervision signals for training a radar semantic segmentation model. The obtained radar semantic segmentation model outperforms LiDAR-based models, providing more consistent and robust segmentation under all-weather conditions, particularly in the snow, rain and fog. To mitigate potential errors in LiDAR semantic labels, we design a dedicated refinement scheme that corrects erroneous labels based on structural features and distribution patterns. The semantic information generated by our radar segmentation model is used in two downstream tasks, achieving significant performance improvements. In large-scale radar-based localization using OpenStreetMap, it leads to localization error reduction by 20.55% over prior methods. For the odometry task, it improves translation accuracy by 16.4% compared to the second-best method, securing the first place in the radar odometry competition at the Radar in Robotics workshop of ICRA 2024, Japan.

13:40-13:45, Paper WeBT29.5
Anti-Degeneracy Scheme for Lidar SLAM Based on Particle Filter in Geometry Feature-Less Environments

Li, Yanbin	Beijing University of Posts and Telecommunications
Zhang, Wei	Beijing University of Posts and Telecommunications
Zhang, Zhiguo	Beijing University of Posts and Telecommunications
Shi, Xiaogang	Beijing University of Posts and Telecommunications
Li, Ziruo	China Agricultural University
Zhang, Mingming	Beihang University
Hongping, Xie	State Grid Jiangsu Electric Power
Chi, Wenzheng	Soochow University
Keywords: SLAM, Localization, Deep Learning Methods Abstract: Simultaneous localization and mapping (SLAM) based on particle filtering has been extensively employed in indoor scenarios due to its high efficiency. However, in geometry feature-less scenes, the accuracy is severely reduced due to lack of constraints. In this article, we propose an anti-degeneracy system based on deep learning. Firstly, we design a scale-invariant linear mapping to convert coordinates in continuous space into discrete indexes, in which a data augmentation method based on Gaussian model is proposed to ensure the model performance by effectively mitigating the impact of changes in the number of particles on the feature distribution. Secondly, we develop a degeneracy detection model (DD-Model) using residual neural networks (ResNet) and transformer which is able to identify degeneracy by scrutinizing the distribution of the particle population. Thirdly, an adaptive anti-degeneracy strategy is designed, which first performs fusion and perturbation on the resample process to provide rich and accurate initial values for the pose optimization, and use a hierarchical pose optimization combining coarse and fine matching, which is able to adaptively adjust the optimization frequency and the sensor trustworthiness according to the degree of degeneracy, in order to enhance the ability of searching the global optimal pose. Finally, we demonstrate the optimality of model, as well as the improvement of image matrix method and GPU on the inference time through ablation experiments. The effectiveness of our method is verified by comparing with SOTA methods.

13:45-13:50, Paper WeBT29.6
BEV-LSLAM: A Novel and Compact BEV LiDAR SLAM for Outdoor Environment

Cao, Fengkui	Shenyang Institute of Automation
Wang, Shaocong	Shenyang Institute of Automation Chinese Academy of Sciences
Chen, Xieyuanli	National University of Defense Technology
Wang, Ting	Robotics Lab., Shenyang Institute of Automation, CAS
Liu, Lianqing	Shenyang Institute of Automation
Keywords: SLAM, Localization, Mapping Abstract: LiDAR-based SLAM is an essential technology for autonomous robots, benefited from its high accuracy and scale invariance. Interestingly, researchers have been increasingly focusing on establishing simple, efficient, but effective LiDAR SLAM systems recently. In this paper, we propose a novel and compact LiDAR-only SLAM system BEV-LSLAM, leveraging visual features in BEV view for all the steps of pipeline including pose estimation, mapping, loop closing and back-end graph optimization. The proposed BEV features are more stable than traditional geometrical features, which can be adapted to various LiDARs sensors without changing the hyperparameters. In addition, benefited from filtering vulnerable features based on tracking process in consecutive frames, only high-quality feature points are used for lightweight pointcloud map construction. Extensive experiments on UrbanLoco, KITTI, and our 16-channel LiDAR datasets prove the superiority of our approach, compared with state-of-the-art LiDAR SLAM methods. The code will be available at: https://github.com/ROBOTWSC/BEV-LSLAM.git.

13:50-13:55, Paper WeBT29.7
Point-Line LIVO Using Patch-Based Gradient Optimization for Degenerate Scenes

Shi, Tong	Southeast University
Qian, Kun	Southeast University
Fang, Yixin	Southeast University
Zhang, Yun	Southeast University
Yu, Hai	State Grid Smart Grid Research Institute
Keywords: SLAM, Localization, Mapping Abstract: Simultaneous localization and mapping based on 3-D light detection and ranging (LiDAR) tends to degenerate in structural-less environments, leading to a distinct reduction in localization accuracy and mapping precision. This article proposes a point-line LiDAR-visual-inertial odometry (PL-LIVO) for robust localization in LiDAR-degenerate scenes. The key idea is integrating both points and lines into the proposed direct visual odometry subsystem (PL-DVO). By minimizing the patch-based gradient residuals for state optimization, PL-DVO provides additional constraints complementary to LiDAR. Furthermore, a LiDAR map assisted visual features depth extraction (LM-VDE) method is proposed to recover 3-D positions of visual features by mapping them onto the 3-D planes of the LiDAR map. This method is independent of the single scan's density and notable for superior generalization across various LiDAR sensors. Extensive experiments on both public datasets and our datasets demonstrate that our system ensures robust pose estimation and outperforms other state-of-the-art systems in LiDAR degenerate scenes.

13:55-14:00, Paper WeBT29.8
SGT-LLC: LiDAR Loop Closing Based on Semantic Graph with Triangular Spatial Topology

Wang, Shaocong	Shenyang Institute of Automation Chinese Academy of Sciences
Cao, Fengkui	Shenyang Institute of Automation
Wang, Ting	Robotics Lab., Shenyang Institute of Automation, CAS
Chen, Xieyuanli	National University of Defense Technology
Shao, Shiliang	SIA
Keywords: SLAM, Localization, Mapping Abstract: Inspired by how humans perceive, remember, and understand the world, semantic graphs have become an efficient solution for scene representation and location. However, many current graph-based LiDAR loop closing methods focus on extracting adjacency matrices or semantic histograms to describe the scene, which ignore a lot of multifaceted topology information for efficiency. In this paper, we propose a LiDAR loop closing method based on semantic graph with triangular spatial topology (SGT-LLC), which fully considers both semantic and spatial topological information. To ensure that descriptors contain robust spatial information while maintaining good rotation invariance, a local descriptor based on semantic topological encoding and triangular spatial topology is proposed, which can effectively correlate scenes and estimate 6-DoF poses. In addition, we aggregate local descriptors from various nodes in the graph using fuzzy classification to create lightweight database and efficient global search.Extensive experiments on KITTI, KITTI360, Apollo, MulRAN and MCD datasets prove the superiority of our approach, compared with state-of-art methods. The code will be available at:https://github.com/ROBOT-WSC/SGT-LLC.git.


WeBT30	106
Aerial Systems: Mechanics and Control 2	Regular Session
Chair: Lu, Peng	The University of Hong Kong
Co-Chair: Li, Zhan	Harbin Institute of Technology

13:20-13:25, Paper WeBT30.1
FAPP: Fast and Adaptive Perception and Planning for UAVs in Dynamic Cluttered Environments

Lu, Minghao	The University of Hong Kong
Fan, Xiyu	Hong Kong University
Chen, Han	The Hongkong Polytechnic University
Lu, Peng	The University of Hong Kong
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Autonomous Vehicle Navigation Abstract: Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the dominant objects. This type of environment poses significant challenges to both perception and planning. Multiple dynamic objects possess various motions, making it extremely difficult to estimate and predict their motions using one motion model. The planning must be highly efficient to avoid cluttered dynamic objects. This paper proposes Fast and Adaptive Perception and Planning (FAPP) for UAVs flying in complex dynamic cluttered environments. A novel and efficient point cloud segmentation strategy is proposed to distinguish static and dynamic objects. To address multiple dynamic objects with different motions, an adaptive estimation method with covariance adaptation is proposed to quickly and accurately predict their motions. Our proposed trajectory optimization algorithm

13:25-13:30, Paper WeBT30.2
A Hybrid Quadrotor with a Passively Reconfigurable Wheeled Leg Capable of Robust Terrestrial Maneuvers

Yu, Size	City University of Hong Kong
Pu, Bingxuan	City University of Hong Kong
Dong, Kaixu	City University of Hong Kong
Bai, Songnan	City University of Hong Kong
Chirarattananon, Pakpong	City University of Hong Kong
Keywords: Aerial Systems: Mechanics and Control, Mechanism Design, Wheeled Robots Abstract: We present a hybrid aerial-ground robot that combines the versatility of a quadcopter with enhanced terrestrial mobility. The vehicle features a passive, reconfigurable single wheeled leg, enabling seamless transitions between flight and two ground modes: a stable stance and a dynamic cruising configuration. The cruising mode exhibits exceptional turning performance, achieving a centrifugal acceleration of 0.55g, over 30% higher than previous records, due to an inherent yaw stabilization effect. This mechanism also reduces control effort and enhances roll stability, enabling reliable navigation on irregular surfaces. While this passive design achieves structural simplicity, it trades off power efficiency for enhanced maneuverability. We provide a comprehensive analysis of the system's dynamics and experimentally demonstrate agile movements across various scenarios.

13:30-13:35, Paper WeBT30.3
Pointillism Wall Painting Drone Using Bouncing Frequency Control

Susbielle, Pierre	Gipsa-Lab CNRS
Dumon, Jonathan	GIPSA-LAB
Hably, Ahmad	Grenoble-Inp
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Art and Entertainment Robotics Abstract: This study presents a new robotic airborne solution for autonomous wall painting using a pointillism technique. The proposed dot-painting drone is a quadcopter equipped with an additional forward propulsion unit and a spring-mounted painting pad. It is designed to bounce on a vertical wall in order to print dots at a controlled frequency along a predefined trajectory. A dynamic model of the system is derived and used to control accurately the bouncing frequency as well as the position of the robot. The performance of the system is validated experimentally, demonstrating successful indoor painting capability of pointillism drawing on vertical walls. This work represents a first step toward fully autonomous, large-scale mural reproduction using aerial robotics.

13:35-13:40, Paper WeBT30.4
Aerial Robots Carrying Flexible Cables: Dynamic Shape Optimal Control Via Spectral Method Model

Shen, Yaolei	University of Twente
Franchi, Antonio	University of Twente / Sapienza University of Rome
Gabellieri, Chiara	University of Twente
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Mobile Manipulation, Aerial Manipulation Abstract: In this work, we present a model-based optimal boundary control design for an aerial robotic system composed of a quadrotor carrying a flexible cable. The whole system is modeled by partial differential equations (PDEs) combined with boundary conditions described by ordinary differential equations (ODEs). The proper orthogonal decomposition (POD) method is adopted to project the original infinite-dimensional system on a finite low-dimensional space spanned by orthogonal basis functions. Based on such a reduced order model, nonlinear model predictive control (NMPC) is implemented online to realize both position and shape trajectory tracking of the flexible cable in an optimal predictive fashion. The proposed POD-based reduced modeling and optimal control paradigms are verified in simulation using an accurate high-dimensional finite difference method (FDM) based model and experimentally using a real quadrotor and a cable. The results show the viability of the POD-based predictive control approach (allowing to close the control loop on the full system state) and its superior performance compared to an optimally tuned PID controller (allowing to close the control loop on the quadrotor

13:40-13:45, Paper WeBT30.5
A New Overactuated Multirotor: Prototype Design, Dynamics Modeling, and Control (I)

Yang, Yipeng	Harbin Institute of Technology
Yu, Xinghu	Ningbo Institute of Intelligent Equipment Technology Co. Ltd
Li, Zhan	Harbin Institute of Technology
Basin, Michael	Autonomous University of Nuevo Leon
Keywords: Aerial Systems: Mechanics and Control, Redundant Robots Abstract: Multirotors have been extensively used due to their simple structure and flexible flight capabilities. However, traditional coplanar multirotors are underactuated in such a way that their 3-degree-of-freedom (DoF) position and 3-DoF attitude cannot be controlled simultaneously. This article proposes a new overactuated flight platform using biaxial-tilt actuation unit (BTAU), called Quad3DV, with capability of antidisturbance 6-DoF trajectory tracking. The structure of Quad3DV is simpler than other published multirotors with BTAUs, and can hover at 90°pitch with higher flight efficiency. First, the 6-DoF multibody dynamic model of the whole system is given considering variable inertia parameters, gyroscopic effects of the BTAU, and coupling effects of translational and rotational motions. Second, based on this model, an antidisturbance Lyapunov stable 6-DoF trajectory tracking controller is designed. Extended state observer is adopted to estimate the unmodeled error and external disturbances. Third, an admissible-wrench-space optimal geometric control allocation method is proposed, which fully exploits the controllability and attitude accessibility of Quad3DV. Fourth, simulations were carried out in the high fidelity dynamics simulation software to compare with other controllers. Finally, flight experiments were conducted to verify the effectiveness of the whole system.

13:45-13:50, Paper WeBT30.6
Design and Control of a Tilt-Rotor Tailsitter Aircraft with Pivoting VTOL Capability

Ma, Ziqing	Delft University of Technology
Smeur, Ewoud	Delft University of Technology
de Croon, Guido	Delft University of Technology
Keywords: Aerial Systems: Mechanics and Control, Field Robots Abstract: Tailsitter aircraft attract considerable interest due to their capabilities of both agile hover and high speed forward flight. However, traditional tailsitters that use aerodynamic control surfaces face the challenge of limited control effectiveness and associated actuator saturation during vertical flight and transitions. Conversely, tailsitters relying solely on tilting rotors have the drawback of insufficient roll control authority in forward flight. This paper proposes a tilt-rotor tailsitter aircraft with both elevons and tilting rotors as a promising solution. By implementing a cascaded weighted least squares (WLS) based incremental nonlinear dynamic inversion (INDI) controller, the drone successfully achieved autonomous waypoint tracking in outdoor experiments at a cruise airspeed of 16 m/s, including transitions between forward flight and hover without actuator saturation. Wind tunnel experiments confirm improved roll control compared to tilt-rotor-only configurations, while comparative outdoor flight tests highlight the vehicle’s superior control over elevon-only designs during critical phases such as vertical descent and transitions. Finally, we also show that the tilt-rotors allow for an autonomous takeoff and landing with a unique pivoting capability that demonstrates stability and robustness under wind disturbances.

13:50-13:55, Paper WeBT30.7
Time-Synchronized Estimator-Based ADP for Spacecraft Optimal Pose Tracking (I)

Yang, Haoyang	University of Warwick
Hu, Qinglei	Beihang University
Shao, Xiaodong	Beihang University
Li, Dongyu	Beihang University
Keywords: Aerial Systems: Mechanics and Control, Robust/Adaptive Control, Optimization and Optimal Control Abstract: This paper addresses the optimal attitude-position integrated (pose) tracking for spacecraft proximity operations without exact knowledge of dynamics parameters. To tackle this challenge, this work proposed a time-synchronized estimator-based adaptive dynamic programming (ADP) to eliminate reliance on exact parameters knowledge in optimal pose tracking problems. Specifically, the concept of time-synchronized convergence is introduced into the estimator design to ensure that mass and inertia estimating errors converge synchronously within a finite time. This synchronization is crucial for online solving the optimal tracking problem. Subsequently, the estimator-based ADP is developed under the dual quaternion framework. This approach demonstrates that optimal pose tracking can be achieved through online learning, without dependence on the exact mass and inertia parameters. Finally, a series of typical simulations and experiments are illustrated to demonstrate the effectiveness and superiority of our technical findings.

13:55-14:00, Paper WeBT30.8
DOP-Based Drift Correction Control for UAVs Operating in Urban Canyon (I)

Xiong, Shangyi	University of Toronto
Liu, Hugh H.-T.	University of Toronto
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Autonomous Vehicle Navigation Abstract: The wide utilization of Global Navigation Satellite System (GNSS) technology in Unmanned Aerial Vehicles (UAVs) has greatly improved the positioning accuracy of UAVs, thereby enhancing flight safety and expanding their applications. However, in densely forested areas or urban canyons where satellite signals are occasionally obstructed, GNSS signals can drift, and the positioning accuracy is compromised. This research paper addresses the special GNSS drift challenge by designing an integrated, custom-built control framework for UAVs, including a novel drift estimation and correction for tracking precision. The estimation and correction are based on the concept of dilution of precision (DOP), a term to quantify the effect of satellite geometry on positioning and timing precision. Experimental investigation mimicking a drone in urban canyon conditions confirms the effectiveness and demonstrates the promising features of the proposed design.


WeCT1	401
Sensor Fusion & SLAM 2	Regular Session
Chair: Fischer, Tobias	Queensland University of Technology

15:00-15:05, Paper WeCT1.1
Self-Localization on a 3D Map by Fusing Global and Local Features from a Monocular Camera

Kikuchi, Satoshi	Meijo University
Kato, Masaya	Meijo University
Tasaki, Tsuyoshi	Meijo University
Keywords: Localization, Computer Vision for Automation Abstract: Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.

15:05-15:10, Paper WeCT1.2
CrossBEV-PR: Cross-Modal Visual-LiDAR Place Recognition Via BEV Feature Distillation

Xu, Jianbo	Shanghai Jiao Tong University
Wu, Xinrui	Shanghai Jiao Tong University
Xuan, Lingfeng	Shanghai Jiao Tong University
Xiao, Yangyi	Shanghai Jiao Tong University
Shi, Jinxuan	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Localization, Recognition, Deep Learning for Visual Perception Abstract: Utilizing 2D images for place recognition within 3D point cloud maps presents significant challenges in autonomous driving applications, largely due to the inherent cross-modal disparity between visual and LiDAR data. In this study, we propose a novel cross-modal visual-LiDAR place recognition method based on Bird's Eye View (BEV) feature distillation. This represents the first end-to-end framework designed to achieve cross-modal place recognition by integrating surround-view images and LiDAR point clouds. Our method effectively mitigates the modality gap between 3D and 2D data by encoding features into a unified BEV representation. Additionally, we introduce a teacher-student distillation training strategy to further enhance the network's cross-modal generalization capabilities. Extensive experiments conducted on benchmark datasets, including nuScenes and Argoverse, demonstrate that our method achieves state-of-the-art (SOTA) performance in cross-modal place recognition tasks. Furthermore, validation on the SJTU-Sanya dataset confirms the robustness and adaptability of our approach in real-world scenarios. We will publicly release our network model and implementation details at https://github.com/IRMVLab/CrossBEV-PR.

15:10-15:15, Paper WeCT1.3
SparseLoc: Sparse Open-Set Landmark-Based Global Localization for Autonomous Navigation

Paul, Pranjal	International Institute of Information Technology
Bhat, Vineeth	IIIT Hyderabad
Salian, Tejas	International Institute of Information Technology, Hyderabad
Mohd, Omama	The University of Texas at Austin
Jatavallabhula, Krishna Murthy	MIT
Arulselvan, Naveen	Indian Institute of Science
Krishna, Madhava	IIIT Hyderabad
Keywords: Localization, Autonomous Vehicle Navigation, RGB-D Perception Abstract: Global localization is a critical problem in autonomous navigation, enabling precise positioning without reliance on GPS. Modern techniques often depend on dense LiDAR maps, which, while precise, require extensive storage and computational resources. Alternative approaches have explored sparse maps and learned features, but suffer from poor robustness and generalization. We propose SparseLoc, a global localization framework that leverages vision-language foundation models to generate sparse, semantic-topometric maps in a zero-shot manner. Our approach combines this representation with Monte Carlo localization enhanced by a novel Late Optimization strategy for improved pose estimation. By constructing compact yet discriminative maps and refining poses through retrospective optimization, SparseLoc overcomes limitations of existing sparse methods, offering a more efficient and robust solution. Our system achieves over 5x improvement in localization accuracy compared to existing sparse mapping techniques. Despite utilizing only 1/500th of the points used by dense methods, it achieves comparable performance, maintaining average global localization error below 5m and 2 degree on KITTI. We further demonstrate the practical applicability of our method through cross-sequence localization experiments and downstream navigation tasks.

15:15-15:20, Paper WeCT1.4
Visual Localization with Offline Google Satellite Map-Assisted for Ground Vehicles in GNSS-Denied Environment

Wang, Jibo	Northeastern University
Mao, Bairen	Northeastern University of China
Pang, Chenglin	Northeastern University
Liu, Shiguang	Northeastern University
Guo, Jindi	Beijing Institute of Control and Electronic Technology
Fang, Zheng	Northeastern University
Keywords: Localization, Intelligent Transportation Systems, Computer Vision for Transportation Abstract: Vehicle localization is a critical component in the planning and navigation of autonomous driving system. Generally, traditional vehicle localization methods rely on the Global Navigation Satellite System (GNSS) for self-localization. Unfortunately, GNSS can become unreliable and may fail in urban canyons, under trees, and beneath overpasses. To address this problem, we propose a visual localization framework assisted by offline Google satellite maps in GNSS-denied environments. And we introduce learning-based ground-to-satellite map feature matching method to mitigate the long-term cumulative drift of visual odometry. To reduce the negative impact of cross-view matching errors on localization accuracy, we propose a novel cross-view pose selection method to build two pose uncertainty models. Moreover, we combine the proposed method with classical SLAM methods to develop a vehicle localization framework. To verify the performance of the proposed method, we carried out the accuracy comparison experiment with state-of-the-art fusion localization methods and feature matching methods. Experimental results indicate that the proposed method achieves the best localization performance compared with the state-of-the-art methods, and our method achieves the root mean square error of 0.290m and 0.014rad in KITTI-05. The implementation code of this paper will be open-source at https://github.com/NEU-REAL/visualLocalization-with-satelli teMap.

15:20-15:25, Paper WeCT1.5
Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction

Hussaini, Somayeh	Queensland University of Technology
Fischer, Tobias	Queensland University of Technology
Milford, Michael J	Queensland University of Technology
Keywords: Localization Abstract: In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. Our approach is agnostic to the underlying VPR technique and effectively predicts SMR, and hence significantly improves VPR performance across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, and present ablation studies including an analysis of the interactions between our SMR predictor and the selected sequence length.

15:25-15:30, Paper WeCT1.6
TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification

Tao, Huaqi	Southern University of Science and Technology
Liu, Bingxi	Southern University of Science and Technology
Chen, Calvin	University of Cambridge
Huang, Tingjun	Southern University of Science and Technology
He, Li	Southern University of Science and Technology
Cui, Jinqiang	Peng Cheng Laboratory
Zhang, Hong	Southern University of Science and Technology
Keywords: Localization, Recognition, Deep Learning for Visual Perception Abstract: Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene texts frequently appear in indoor spaces and can help distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene texts. Finally, the discriminative texts are filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.

15:30-15:35, Paper WeCT1.7
CaRtGS: Computational Alignment for Real-Time Gaussian Splatting SLAM

Feng, Dapeng	Sun Yat-Sen University
Chen, Zhiqiang	The University of Hong Kong
Yin, Yizhen	Sun Yat-Sen University
Zhong, Shipeng	Sun Yat-Sen University
Qi, Yuhua	Sun Yat-Sen University
Chen, Hongbo	Sun Yat-Sen University
Keywords: Mapping, SLAM Abstract: Simultaneous Localization and Mapping (SLAM) is pivotal in robotics, with photorealistic scene reconstruction emerging as a key challenge. To address this, we introduce Computational Alignment for Real-Time Gaussian Splatting SLAM (CaRtGS), a novel method enhancing the efficiency and quality of photorealistic scene reconstruction in real-time environments. Leveraging 3D Gaussian Splatting (3DGS), CaRtGS achieves superior rendering quality and processing speed, which is crucial for scene photorealistic reconstruction. Our approach tackles computational misalignment in Gaussian Splatting SLAM (GS-SLAM) through an adaptive strategy that enhances optimization iterations, addresses long-tail optimization, and refines densification. Experiments on Replica, TUM-RGBD, and VECtor datasets demonstrate CaRtGS's effectiveness in achieving high-fidelity rendering with fewer Gaussian primitives. This work propels SLAM towards real-time, photorealistic dense rendering, significantly advancing photorealistic scene representation. For the benefit of the research community, we release the code and accompanying videos on our project website: https://dapengfeng.github.io/cartgs.

15:35-15:40, Paper WeCT1.8
MemGS: Memory-Efficient Gaussian Splatting for Real-Time SLAM

Bai, Yinlong	Hunan University
Zhang, Hongxin	School of Robotics, Hunan University
Zhong, Sheng	Hunan University
Niu, Junkai	Hunan University
Li, Hai	Zhejiang University
He, Yijia	TCL RayNeo
Zhou, Yi	Hunan University
Keywords: Mapping, SLAM Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.


WeCT2	402
Vehicle Intelligence	Regular Session
Chair: Xiao, Xuesu	George Mason University
Co-Chair: Tzes, Anthony	New York University Abu Dhabi

15:00-15:05, Paper WeCT2.1
Experimental Evaluation of Safe Trajectory Planning for an Omnidirectional UAV

Hamandi, Mahmoud	New York University Abu Dhabi
Ali, Abdullah Mohamed	New York University Abu Dhabi
Tzes, Anthony	New York University Abu Dhabi
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Autonomous Vehicle Navigation, Aerial Systems: Applications, Motion and Path Planning Abstract: Autonomous aerial vehicles play a critical role in search and rescue operations, where navigation through cluttered and confined environments is essential. To this end, this paper presents a novel trajectory planning framework for omnidirectional drones that dynamically adjusts tracking velocity based on the platform's proximity to obstacles, ensuring a balance between safety and efficiency in cluttered and challenging environments. The proposed approach generates a geometric path to the target location. At each waypoint, the minimum distance between the drone’s convex hull and surrounding obstacles is determined, allowing the computation of the velocity constraints. By slowing down near obstacles and accelerating in open spaces, the method enhances both safety and maneuverability. The framework is validated through real-world experiments using the OmniOcta UAV, demonstrating its ability to navigate through constrained spaces. Furthermore, we present an experimental study to investigate key sources of tracking deviations, including propeller dynamics and aerodynamic interactions near obstacles.

15:05-15:10, Paper WeCT2.2
Simulating Automotive Radar with Lidar and Camera Inputs

Song, Peili	Dalian University of Technology
Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Yang, Yifan	Nankai University
Lan, Enfan	Nankai University
Liu, Jingtai	Nankai University
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Simulation and Animation Abstract: Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.

15:10-15:15, Paper WeCT2.3
GOEN: Guided Obstacle Endpoint Navigation for Real-Time Collision-Free Path Planning in Unstructured Environments

Zhao, Zhicheng	Shandong University
Zhang, Yifan	Shandong University
Chen, Teng	Shandong University
Rong, Xuewen	Shandong University
Liu, Dayu	Shandong Youbaote Intelligent Robotics CO., LTD
Li, Yibin	Shandong University
Keywords: Autonomous Vehicle Navigation, Collision Avoidance, Legged Robots Abstract: We present GOEN, an advanced navigation and path planning framework specifically engineered to tackle the complexities of dynamic and unstructured environments through real-time 3D pointcloud processing. Our approach integrates pointcloud downsampling, collision risk assessment, and obstacle endpoint extraction to generate intermediate waypoints. These waypoints are refined iteratively through multi-stage safety validation and cubic-spline interpolation, resulting in kinematically feasible and collision-free trajectories. The system exhibits minimal computational overhead, achieving a planning latency below 10 milliseconds, thereby demonstrating suitability for deployment in resource-constrained scenarios. Extensive empirical evaluations in simulated environments and on quadruped robots demonstrate the framework's robustness in dynamically identifying optimal navigation paths across unstructured terrains. Compared with baseline navigation methods, it has significant advantages in indicators representing navigation performance. Experimental validation confirms GOEN's capability to balance trajectory optimality, safety constraints, and real-time responsiveness, providing an innovative solution for autonomous navigation in complex environments.

15:15-15:20, Paper WeCT2.4
TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

Puthumanaillam, Gokul	University of Illinois Urbana-Champaign
Padrao, Paulo	Providence College
Fuentes, Jose	Florida International University
Thangeda, Pranay	University of Illinois Urbana-Champaign
Schafer, William E.	University of Illinois Urbana-Champaign
Song, Jae Hyuk	University of Illinois Urbana-Champaign
Jagdale, Karan	Lucid Group
Bobadilla, Leonardo	Florida International University
Ornik, Melkior	University of Illinois Urbana-Champaign
Keywords: Autonomous Vehicle Navigation, Marine Robotics, Intention Recognition Abstract: Predicting the near-term behavior of a reactive agent is crucial in many robotic scenarios, yet remains challenging when observations of that agent are sparse or intermittent. Vision-Language Models (VLMs) offer a promising avenue by integrating textual domain knowledge with visual cues, but their one-shot predictions often miss important edge cases and unusual maneuvers. Our key insight is that iterative, counterfactual exploration--where a dedicated module probes each proposed behavior hypothesis, explicitly represented as a plausible trajectory, for overlooked possibilities--can significantly enhance VLM-based behavioral forecasting. We present TRACE (underline{T}ree-of-thought underline{R}easoning underline{A}nd underline{C}ounterfactual underline{E}xploration), an inference framework that couples tree-of-thought generation with domain-aware feedback to refine behavior hypotheses over multiple rounds. Concretely, a VLM first proposes candidate trajectories for the agent; a counterfactual critic then suggests edge-case variations consistent with partial observations, prompting the VLM to expand or adjust its hypotheses in the next iteration. This creates a self-improving cycle where the VLM progressively internalizes edge cases from previous rounds, systematically uncovering not only typical behaviors but also rare or borderline maneuvers, ultimately yielding more robust trajectory predictions from minimal sensor data. We validate TRACE on both ground-vehicle simulations and real-world marine autonomous surface vehicles. Experimental results show that our method consistently outperforms standard VLM-driven and purely model-based baselines, capturing a broader range of feasible agent behaviors despite sparse sensing. Evaluation videos and code are available at trace-robotics.github.io.

15:20-15:25, Paper WeCT2.5
VertiSelector: Automatic Curriculum Learning for Wheeled Mobility on Vertically Challenging Terrain

Xu, Tong	George Mason University
Pan, Chenhui	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Machine Learning for Robot Control Abstract: Reinforcement Learning (RL) has the potential to enable extreme off-road mobility by circumventing complex kinodynamic modeling, planning, and control by simulated end-to-end trial-and-error learning experiences. However, most RL methods are sample-inefficient when training in a large amount of manually designed simulation environments and struggle at generalizing to the real world. To address these issues, we introduce VertiSelector (VS), an automatic curriculum learning framework designed to enhance learning efficiency and generalization by selectively sampling training terrain. VS prioritizes vertically challenging terrain with higher Temporal Difference (TD) errors when revisited, thereby allowing robots to learn at the edge of their evolving capabilities. By dynamically adjusting the sampling focus, VS significantly boosts sample efficiency and generalization within the VW-Chrono simulator built on the Chrono multi-physics engine. Furthermore, we provide simulation and physical results using VS on a Verti-4-Wheeler platform. These results demonstrate that VS can achieve 23.08% improvement in terms of success rate by efficiently sampling during training and robustly generalizing to the real world.

15:25-15:30, Paper WeCT2.6
Independent Observers-Based Fault Diagnosis for Multiple Sensor Faults of Full-Vehicle Active Suspension Systems with Inaccurate Models (I)

Yan, Shuai	Beijing Institute of Technology
Xia, Yuanqing	Beijing Institute of Technology
Zhai, Di-Hua	Beijing Institute of Technology
Keywords: Failure Detection and Recovery, Robust/Adaptive Control Abstract: In this paper, an independent observers based fault diagnosis method is proposed for multiple sensor faults of the full-vehicle active suspension system with an inaccurate model. To address the problem of model uncertainties brought by the linearized full-vehicle suspension model, unmodelled dynamics, parametric uncertainties, and external disturbances are first combined into an integrated uncertain term. Disturbance observers are designed to online track the integrated uncertain terms, whose estimates will be employed to decrease the influence of model uncertainties on fault diagnosis. An independent fault diagnosis observer is designed for each sensor separately, where only the measurement of the matched sensor is taken as the observer input. In this way, each fault diagnosis observer works independently and interactions between measurements of multiple faulty sensors can be decoupled to locate the sensor faults. Anomalies of each sensor are monitored by the fault diagnosis observer in a one-to-one relationship such that fault detection and isolation can be realized at the same time. In particular, the sensitivity and robustness of the fault diagnosis method are improved with the estimates of the integrated uncertain terms injected into the fault diagnosis observers to emulate model uncertainties. Effectiveness of the proposed scheme has been validated via simulation results considering multiple sensor faults occurring at different time or simultaneously.

15:30-15:35, Paper WeCT2.7
Long-Horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Ouyang, Yutao	Tsinghua University
Li, Jinhan	Tsinghua University
Li, Yunfei	Tsinghua University
Li, Zhongyu	University of California, Berkeley
Yu, Chao	Tsinghua University
Sreenath, Koushil	University of California, Berkeley
Wu, Yi	Tsinghua University
Keywords: Agent-Based Systems, AI-Enabled Robotics, Deep Learning Methods Abstract: We present a large language model (LLM) based system to empower quadrupedal robots with problem-solving abilities for long-horizon tasks beyond short-term motions. Long-horizon tasks for quadrupeds are challenging since they require both a high-level understanding of the semantics of the problem for task planning and a broad range of locomotion and manipulation skills to interact with the environment. Our system builds a high-level reasoning layer with large language models, which generates hybrid discrete-continuous plans as robot code from task descriptions. It comprises multiple LLM agents: a semantic planner for sketching a plan, a parameter calculator for predicting arguments in the plan, a code generator to convert the plan into executable robot code, and a replanner that replans for handling execution failures or human interventions. At the low level, we adopt reinforcement learning to train a set of motion planning and control skills to unleash the flexibility of quadrupeds for rich environment interactions. Our system is tested on long-horizon tasks that are infeasible to complete with one single skill. Simulation and real-world experiments show that it successfully figures out multi-step strategies and demonstrates non-trivial behaviors, including building tools or notifying a human for help. Demos are available on our project page: https://sites.google.com/view/long-horizon-robot.

15:35-15:40, Paper WeCT2.8
A Two-Stage Lightweight Framework for Efficient Land-Air Bimodal Robot Autonomous Navigation

Li, Yongjie	Shenzhen University
Liu, Zhou	The Hong Kong University of Science and Technology
Yu, Wenshuai	Shenzhen University
Lu, Zhangji	Shenzhen University
Chenyang, Wang	Shenzhen University
Yu, Fei	Guangming Lab
Li, Qingquan	Shenzhen University
Keywords: Autonomous Vehicle Navigation Abstract: Land-air bimodal robots (LABR) are gaining attention for autonomous navigation, combining high mobility from aerial vehicles with long endurance from ground vehicles. However, existing LABR navigation methods are limited by suboptimal trajectories from mapping-based approaches and the excessive computational demands from learning-based methods. To address this, we propose a two-stage lightweight framework that integrates global key points prediction with local trajectory refinement to generate efficient and reachable trajectories. In the first stage, the Global Key points Prediction Network (GKPN) to generate hybrid land-air key points path. The GKPN includes a Sobel Perception Network (SPN) for improved obstacle detection and a Lightweight Attention Planning Network (LAPN) for improves predictive ability by capturing contextual information. In the second stage, the global path is segmented based on predicted key points and refined using a mapping-based planner to create smooth, collision-free trajectories. Experiments conducted on LABR platform show that our framework reduces network parameters by 14% and energy consumption during land-air transitions by 35% compared to existing approaches. The framework achieves real-time navigation without GPU acceleration and enables zero-shot transfer from simulation to reality during deployment.


WeCT3	403
Soft Sensors and Actuators 3	Regular Session
Chair: George Thuruthel, Thomas	University College London
Co-Chair: Kawaharazuka, Kento	The University of Tokyo

15:00-15:05, Paper WeCT3.1
An Embedded Proprioceptive Sensing Method for Soft Artificial Muscles with Tube-Fiber Structure (I)

Su, Yujie	The Chinese Unverisity of Hong Kong
Xie, Disheng	The Chinese University of Hong Kong
Yang, Siyu	The Chinese University of Hong Kong
Liu, MingHao	CUHK
Tong, Kai Yu	The Chinese University of Hong Kong
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Soft artificial muscles have been used for wearable robots and unstructured environments as they are inherently compliant and flexible. However, this soft characteristic also leads to the challenge of feedback on contraction length and force. This study proposes an embedded proprioceptive displacement and force sensing method for tube-fiber pneumatic artificial muscle. In these actuators, the generated movement and force are due to the deformation of the tubes. The proposed sensing methods filled the tube with magnetic material, to convert the deformation of the tube into changes in the magnetic field. These changes in the magnetic field are detected by Hall sensors. Experimental results show that the output of the Hall sensor exhibits a monotonic relationship with the actual displacement and force changes, demonstrating high repeatability and low hysteresis. These findings are consistent with the behavior expected from the mathematical model. Data-driven models were built for displacement and force estimation. The accuracy was 2.8% RMSE(Root Mean Square Error) in measuring displacement and 4.56% RMSE in measuring force. In addition, the practical application of the sensing method is demonstrated by a human-subject test, showcasing its ability to estimate displacement and force in dynamic and practical scenarios. The integration of compact sensors within artificial muscles renders this sensing technique a promising solution for addressing the challenges of force and displacement sensing in soft artificial muscles. It offers a solution for artificial muscles and holds promise for future advancements in this field.

15:05-15:10, Paper WeCT3.2
Biodegradable Dielectric Elastomer Actuators and Sensors

Takai, Kazuma	The University of Electro-Communications
Murakami, Kazuya	The University of Electro-Communications
Shintake, Jun	The University of Electro-Communications
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Soft robotics, a field dedicated to the development of robots using flexible and compliant materials, has undergone significant progress in recent years. Several features of soft robots including their physical adaptability to their surroundings, shock-absorption capabilities, and minimal risk of harm to humans and objects have rendered them ideal for various applications. However, most soft robots are fabricated using synthetic elastomeric materials, contributing to environmental pollution and degradation. A potential solution to this problem involves integrating biodegradability into the designs of soft robots, facilitating their degradation through microbial activity and subsequent integration into the soil. However, to achieve this functionality, biodegradable soft robotic elements must be developed. This paper presents biodegradable dielectric elastomer actuators (DEAs) and sensors (DESs). These devices feature a soft dielectric membrane with compliant electrodes on both sides, enabling them to function as both electrostatic actuators and capacitive sensors. Natural rubber and gelatin-based elastomeric materials are employed for the dielectric membrane and electrodes, respectively. Using these materials and established fabrication processes, experimental biodegradable DEA and DES samples are fabricated and characterized. The DEA with a circular actuator configuration demonstrates a voltage-controllable areal strain of up to 15.4 ± 0.4% and presents stable operation over 1,000 actuation cycles. The DEA with a bending actuator configuration exhibits a voltage-controllable bending angle of up to 17.4° ± 1.9°. The DES demonstrates a linear response for strains up to 200%, with a gauge factor of 0.85 ± 0.017, and maintains stability over 10,000 strain cycles. The observed characteristics of the DEAs and DES align well with theoretical predictions, highlighting the potential applicability of biodegradable DEAs and DESs as promising elements for sustainable and environmentally friendly soft robots.

15:15-15:20, Paper WeCT3.4
Efficient Pneumatic Twisted and Coiled Actuators through Dual Enforced Anisotropy (I)

Weissman, Eric	Arizona State University
Ashcroft, Brianna	Arizona State University
Nguyen, Phong	Arizona State University
Sun, Jiefeng	Arizona State University
Keywords: Soft Sensors and Actuators, Hydraulic/Pneumatic Actuators, Soft Robot Materials and Design Abstract: The Cavatappi muscle is a novel soft actuator offering many desirable attributes such as large linear contraction with negligible radial expansion, compliance, and low cost. However, it has a low energy efficiency (9%) and requires high-pressure inputs (over 240 psi), limiting its effectiveness in robotics applications. This work proposes a fiber-reinforced pneumatic twisted-and-coiled actuator (FR-PTCA) that addresses these shortcomings by introducing a fiber reinforcement to increase the tube anisotropy. The FR-PTCA has 1) higher energy efficiency (over 19%) and 2) lower-pressure actuation compared to the Cavatappi muscle without sacrificing muscle strain (70% strain at 130 psi). In addition, this work also presents an analytical model of the FR-PTCA that can be used for design optimization and model-based control. The potential applications of these novel actuators are demonstrated through a continuum robot driven by the FR-PTCA.

15:20-15:25, Paper WeCT3.5
Electrical Impedance Tomography Based Finger-Shaped Soft Artificial Skin

Huang, Yunqi	University College London
George Thuruthel, Thomas	University College London
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Soft Robot Applications Abstract: Obtaining dense contact information for feedback control is vital for robotic manipulation. However, existing tactile sensing technologies have a large footprint, making seamless integration with current robotic devices difficult. This paper presents a novel multi-layer Electrical Impedance Tomography (EIT) sensor architecture designed to provide distributed, high-density tactile sensing with a small form factor. Using our multilayer structure, we address a common issue in other EIT-based tactile skins that prevents electrodes from being placed distally from the sensing surface. Our innovative multi-layer design enables the development of complex-shaped soft sensing skins without any electronic components on the sensing surface, achieving very high accuracy. To demonstrate practical applications, we fabricated a finger-shaped three-dimensional (3D) skin and conducted experiments to collect real-world data. We developed a perception model for the tactile sensor by employing data-driven machine learning methods to predict press localization and force with high accuracy based on EIT signals. Our work presents a significant step towards developing whole body full soft tactile sensors with a small form factor.

15:25-15:30, Paper WeCT3.6
M3D-Skin: Multi-Material 3D-Printed Tactile Sensor with Hierarchical Infill Structures for Pressure Sensing

Yoshimura, Shunnosuke	The University of Tokyo
Kawaharazuka, Kento	The University of Tokyo
Okada, Kei	The University of Tokyo
Keywords: Soft Sensors and Actuators, Additive Manufacturing, Force and Tactile Sensing Abstract: Tactile sensors have a wide range of applications, from utilization in robotic grippers to human motion measurement. If tactile sensors could be fabricated and integrated more easily, their applicability would further expand. In this study, we propose a tactile sensor—M3D-skin—that can be easily fabricated with high versatility by leveraging the infill patterns of a multi-material fused deposition modeling (FDM) 3D printer as the sensing principle. This method employs conductive and non-conductive flexible filaments to create a hierarchical structure with a specific infill pattern. The flexible hierarchical structure deforms under pressure, leading to a change in electrical resistance, enabling the acquisition of tactile information. We measure the changes in characteristics of the proposed tactile sensor caused by modifications to the hierarchical structure. Additionally, we demonstrate the fabrication and use of a multi-tile sensor. Furthermore, as applications, we implement motion pattern measurement on the sole of a foot, integration with a robotic hand, and tactile-based robotic operations. Through these experiments, we validate the effectiveness of the proposed tactile sensor.


WeCT4	404
Surgical Robotics: Laparoscopy	Regular Session
Chair: Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Co-Chair: Song, Cheol	DGIST

15:00-15:05, Paper WeCT4.1
Label-Supervised Surgical Instrument Segmentation Using Temporal Equivariance and Semantic Continuity

Wang, Qiyuan	University of Science and Technology of China
Liu, Yanzhe	Chinese People's Liberation Army (PLA) General Hospital
Zhao, Shang	Univerisity of Science and Techonology of China
Liu, Rong	Chinese People's Liberation Army (PLA) General Hospital
Zhou, S Kevin	USTC
Keywords: Surgical Robotics: Laparoscopy, Object Detection, Segmentation and Categorization, Semantic Scene Understanding Abstract: In robotic surgery, instrument presence labels are typically recorded alongside video streams, offering a cost-effective alternative to manual annotations for segmentation tasks. Label-supervised surgical instrument segmentation (SIS), a weakly supervised segmentation setting where only instrument presence labels are available, remains underexplored due to its inherently ill-posed nature. Temporal information plays a vital role in capturing sequential dependencies, thereby enhancing representation learning even under incomplete supervision. This paper extends a two-stage label-supervised segmentation framework by leveraging the temporal characteristics of surgical videos from three perspectives. First, a temporal equivariance constraint is introduced to enforce pixel-level consistency across adjacent frames. Second, a class-aware semantic continuity constraint is applied to preserve coherence between global and local regions over time. Third, temporally-enhanced pseudo masks are generated from consecutive frames to suppress irrelevant regions and improve segmentation accuracy. We evaluate our method on two surgical video datasets: the Cholec80 cholecystectomy benchmark and a real-world robotic left lateral segmentectomy (RLLS) dataset. Instance-level instrument annotations, sampled at regular intervals and validated by an experienced clinician, provide a reliable basis for evaluation. Experimental results demonstrate that our method consistently achieves favorable performances over state-of-the-art methods. These findings highlight the effectiveness of incorporating temporal constraints into label-supervised frameworks, offering a promising strategy to reduce annotation costs and advance surgical video analysis.

15:05-15:10, Paper WeCT4.2
Handheld Confocal Endomicroscope System with Tremor Compensation for Retinal Imaging

Lee, Myung Ho	DGIST
Cho, Gichan	DGIST
Im, Jintaek	DGIST
Na, Jongyeol	Daegu Gyeongbuk Institute of Science and Technology
Song, Cheol	DGIST
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Steerable Catheters/Needles, Sensor-based Control Abstract: Advancements in biophotonics have driven the development of miniaturized imaging probes for high-resolution in vivo imaging. Probe-based confocal laser endomicroscopy (pCLE) enables cellular-level visualization of tissues but remains challenging for retinal imaging due to the need for non-contact operation, tremor compensation, and precise focal control. This study introduces a novel handheld confocal endomicroscope system that integrates a custom-built imaging probe, an optical coherence tomography (OCT) distance sensor, and motor-assisted tremor suppression to improve imaging stability and resolution. The system employs a fiber-based common-path swept-source OCT (CPSS-OCT) sensor to maintain a stable focal distance while compensating for involuntary hand tremors using motorized stabilization. A gated recurrent unit (GRU)-based tremor prediction algorithm further enhances image stability. The imaging probe features a PZT tube-driven fiber cantilever resonance for Lissajous scanning, providing a wide field of view with minimal image distortion. In experiments using bovine eye samples, the CR score improved from 0.318 to 0.472, with a 48.43% increase in the in-focus condition when tremor compensation was activated, confirming enhanced image clarity and stability. Experimental results demonstrate that the system effectively stabilizes imaging, reduces motion artifacts, and ensures high-resolution, non-contact retinal imaging. By addressing the limitations of conventional pCLE devices, this system represents a significant advancement in ophthalmic imaging and can potentially improve retinal diagnostics and precision-guided interventions.

15:10-15:15, Paper WeCT4.3
Physics-Informed LSTM for Shape and Contact Force Prediction of a Flexible Surgical Robot

Ju, Feng	Nanjing University of Aeronautics and Astronautics
Wang, Chen	Nanjing University of Aeronautics and Astronautics
Wang, Yingying	Nanjing University of Aeronautics and Astronautics
Wang, Yuxing	Nanjing University of Aeronautics and Astronautics
Ding, Liping	Nanjing University of Aeronautics and Astronautics
Keywords: Surgical Robotics: Laparoscopy, Force and Tactile Sensing, Soft Robot Applications Abstract: Real-time morphological perception and precise end force feedback prediction of surgical robots constitute critical technical elements for ensuring safety and efficacy in complex interventional procedures such as Endoscopic Retrograde Cholangiopancreatography (ERCP). In this paper, we design a miniature flexible surgical robot (FSR) with a nested spring structure and proposed a physics-informed deep learning approach to simultaneously predict both the FSR's shape and 2D contact forces at its end-effector. The physical constraints were derived from a quasi-static model of the FSR, which is capable of characterizing persistent environmental interactions. Our method eliminates the need for end-effector sensors, not only ensuring high accuracy in both shape and contact force predictions but also maintaining consistent predictive performance under continuous environmental interactions. Experimental validation of the method revealed a high consistency between predicted values and reference data, achieving a 34.97% improvement in computational speed and a maximum prediction accuracy enhancement of 71.64% compared to conventional LSTM approaches.

15:15-15:20, Paper WeCT4.4
Learning to Perform Low-Contact Autonomous Nasotracheal Intubation by Recurrent Action-Confidence Chunking with Transformer

Tian, Yu	The Chinese University of Hong Kong
Hao, Ruoyi	The Chinese University of Hong Kong
Huang, Yiming	The Chinese University of Hong Kong
Xie, Dihong	The Chinese University of Hong Kong
Chan, Catherine Po Ling	The Chinese University of Hong Kong
Chan, Jason Ying-Kuen	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Surgical Robotics: Planning, Learning from Demonstration, Imitation Learning Abstract: Nasotracheal intubation (NTI) is critical for establishing artificial airways in clinical anesthesia and critical care. Current manual methods face significant challenges, including cross-infection, especially during respiratory infection care, and insufficient control of endoluminal contact forces, increasing the risk of mucosal injuries. While existing studies have focused on automated endoscopic insertion, the automation of NTI remains unexplored despite its unique challenges: Nasotracheal tubes exhibit greater diameter and rigidity than standard endoscopes, substantially increasing insertion complexity and patient risks. We propose a novel autonomous NTI system with two key components to address these challenges. First, an autonomous NTI system is developed, incorporating a prosthesis embedded with force sensors, allowing for safety assessment and data filtering. Then, the Recurrent Action-Confidence Chunking with Transformer (RACCT) model is developed to handle complex tube-tissue interactions and partial visual observations. Experimental results demonstrate that the RACCT model outperforms the ACT model in all aspects and achieves a 66% reduction in average peak insertion force compared to manual operations while maintaining equivalent success rates. This validates the system's potential for reducing infection risks and improving procedural safety.

15:20-15:25, Paper WeCT4.5
Autonomous Dissection in Robotic Cholecystectomy

Oh, Ki-Hwan	University of Illinois at Chicago
Borgioli, Leonardo	University of Illinois Chicago
Zefran, Milos	University of Illinois at Chicago
Valle, Valentina	Surgical Innovation and Training Lab, Department of Surgery, Col
Giulianotti, Pier Cristoforo	Surgical Innovation and Training Lab, Department of Surgery, Col
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Robotic surgery offers enhanced precision and adaptability, paving the way for automation in surgical interventions. Cholecystectomy, the gallbladder removal, is particularly well-suited for automation due to its standardized procedural steps and distinct anatomical boundaries. A key challenge in automating this procedure is dissecting with accuracy and adaptability. This paper presents a vision-based autonomous robotic dissection architecture that integrates real-time segmentation, keypoint detection, grasping and stretching the gallbladder with the left arm, and dissecting with the other arm. We introduce an improved segmentation dataset based on videos of robotic cholecystectomy performed by various surgeons, incorporating a new ``liver bed'' class to enhance boundary tracking after multiple rounds of dissection. Our system employs state-of-the-art segmentation models and an adaptive boundary extraction method that maintains accuracy despite tissue deformations and visual variations. Moreover, we implemented an automated grasping and pulling strategy to optimize tissue tension before dissection upon our previous work. Ex vivo evaluations on porcine livers demonstrate that our framework significantly improves dissection precision and consistency, marking a step toward fully autonomous robotic cholecystectomy.

15:25-15:30, Paper WeCT4.6
CRESSim-MPM: A Material Point Method Library for Surgical Soft Body Simulation with Cutting and Suturing

Ou, Yafei	University of Alberta
Tavakoli, Mahdi	University of Alberta
Keywords: Surgical Robotics: Laparoscopy, Simulation and Animation, Virtual Reality and Interfaces Abstract: A number of recent studies have focused on developing surgical simulation platforms to train machine learning (ML) agents or models with synthetic data for surgical assistance. While existing platforms excel at tasks such as rigid body manipulation and soft body deformation, they struggle to simulate more complex soft body behaviors like cutting and suturing. A key challenge lies in modeling soft body fracture and splitting using the finite-element method (FEM), which is the predominant approach in current platforms. Additionally, the two-way suture needle/thread contact inside a soft body is further complicated when using FEM. In this work, we use the material point method (MPM) for such challenging simulations and propose new rigid geometries and soft-rigid contact methods specifically designed for them. We introduce CRESSim-MPM, a GPU-accelerated MPM library that integrates multiple MPM solvers and incorporates surgical geometries for cutting and suturing, serving as a specialized physics engine for surgical applications. It is further integrated into Unity, requiring minimal modifications to existing projects for soft body simulation. We demonstrate the simulator's capabilities in real-time simulation of cutting and suturing on soft tissue and provide an initial performance evaluation of different MPM solvers when simulating varying numbers of particles. The source code is available at https://github.com/yafei-ou/CRESSim-MPM.

15:30-15:35, Paper WeCT4.7
Autonomous Suturing Method for Robot-Assisted Minimally Invasive Surgery

Feng, Mei	Jilin University China
Li, Haoju	Jilin University
Li, Yao	Jilin University
Yang, Kun	Aviation University of Air Force
He, Dong	Jilin University
Lu, Xiuquan	Jilin University
Keywords: Surgical Robotics: Planning, Task and Motion Planning, Deep Learning Methods Abstract: Robot-assisted minimally invasive surgery is widely used because of its superior postoperative recovery outcomes. However, the workload for surgeons remains high. The development of autonomous suturing capabilities in surgical robots is poised to significantly reduce surgeon workload. In this study, we present a novel method or autonomous suturing using a minimally invasive surgical robot. We quantify the surgical suturing requirements and propose corresponding metrics for evaluating the suturing effect. We also use the dynamic adjustment of stitch position to optimize the surgical robot autonomous suturing scheme. Furthermore, we employ particle swarm algorithms to enhance the grasping posture of surgical instruments, enabling the robot to achieve optimal suture needle clamping. Our method maintains the same level of expert operator in the suturing parametric index of suturing when suturing two types of wounds: gauze and egg membrane. The autonomous suturing method proposed in this study is currently deployed on our own surgical robot, and it can be generalized to other surgical robots. This will lay the foundation for surgical robots to achieve fully autonomous surgery. The experimental results show that the stitching effect of our proposed autonomous robot stitching method is already close to that of surgeons using the same robot, and it maintains good consistency in multiple sets of experiments. The method proposed in this study can be generalized to various other surgical robots, laying the foundation for surgical robots to achieve fully autonomous surgery.

15:35-15:40, Paper WeCT4.8
Differential Six-Axis Force and Torque Measurement in a Prototype Robotic Surgical Instrument

Neykov, Daniel	Resense GmbH
Markert, Timo	Resense GmbH
Hellinger, Niklas	Resense GmbH
Theissler, Andreas	Aalen University of Applied Sciences
Atzmueller, Martin	Osnabrück University, Institute of Computer Science, Semantic In
Matich, Sebastian	WITTENSTEIN SE
Keywords: Surgical Robotics: Laparoscopy, Haptics and Haptic Interfaces, Medical Robots and Systems Abstract: In robot-assisted minimally invasive surgery (RMIS), the absence of haptic feedback presents a significant challenge for surgeons in accurately gauging the forces applied during procedures. However, obtaining precise force/torque (F/T) information at the surgical site is challenging. One key obstacle is distinguishing external forces from those induced by the cable-actuated kinematics of the surgical tool. We present a novel method to eliminate this interference by employing differential F/T measurement. We utilize two miniature 6-axis F/T sensors, positioned proximally and distally, to counterbalance the undesired forces and torques generated by the cable-driven system. To demonstrate the efficacy of this approach, we developed an experimental cable-actuated forceps with two degrees of freedom. We conducted a series of dynamic tests, attaching various weight configurations to the gripper to simulate external forces ranging from 0.5 N to 1.5 N. Subsequently, we evaluated three measurement methods: raw distal sensor readings, differential compensation, and a multilayer perceptron (MLP) that processes a sliding window of inputs from both sensors and actuators. Differential compensation improves performance by 70% over the distal sensor alone, achieving a root-mean-square error (RMSE) of 0.15 N and 3 mNm across the entire dataset. The MLP yields a further improvement of 90% lower RMSE relative to the distal sensor, achieving 0.05 N and 0.5 mNm on a test subset of the data not used for training.


WeCT5	407
AI-Based Methods	Regular Session
Chair: Miao, Fei	University of Connecticut

15:00-15:05, Paper WeCT5.1
ContextCache: Task-Aware Lifecycle Management for Memory-Efficient LLM Agent Deployment

Tao, Liu	InchiTech
Guo, Ping	Intel
Feng, Dong	InchiTech
Wang, Peng	Intel
Keywords: Agent-Based Systems, AI-Based Methods Abstract: LLM-based agents have demonstrated remarkable capabilities in multi-step reasoning and task execution across domains such as robotics and autonomous systems. However, deploying these agents on resource-constrained platforms presents a fundamental challenge: reducing inference latency while maintaining efficient memory usage. Existing caching techniques (KVCache, PrefixCache, PromptCache) improve inference speed by reusing cached context but overlook LLM dependency relationships in agent workflows, leading to excessive memory usage or redundant recomputation. To address this, we propose ContextCache, a task-aware lifecycle management framework that optimizes context fragment caching for multi-step LLM agents. ContextCache predicts the lifespan of each context fragment and dynamically allocates and releases GPU memory accordingly. We evaluate our approach on a newly constructed dataset, covering logistics coordination, assembly tasks, and health management. Experimental results demonstrate a 15% reduction in memory usage compared to state-of-the-art caching strategies, with no loss in inference efficiency, making our approach well-suited for real-world deployment in resource-constrained environments.

15:05-15:10, Paper WeCT5.2
GIPD: Global Intent Prediction and Decomposition of Cooperative Multi-Robot System in Non-Communication Environments

Zhao, Yu	Shanghai Jiao Tong University
Liu, Zhe	Shanghai Jiao Tong University
Wei, Haoyu	Shanghai Jiao Tong University
Wang, Kai	Shanghai Jiao Tong University
Wang, Haitao	China State Shipbuilding Corporation 726th Research Institude
Zhai, Duwen	China State Shipbuilding Corporation 726th Research Institude
Jin, Kefan	Shanghai Jiao Tong University
Shao, Haibin	Shanghai Jiao Tong University
Keywords: Agent-Based Systems, AI-Enabled Robotics Abstract: In complex multi-robot application scenarios, particularly in dynamically adversarial, hazardous, or disaster environments, traditional cooperation paradigms face significant challenges due to unreliable or absent communication links. Achieving efficient cooperation in the absence of communication has become a key bottleneck limiting the performance of multirobot systems. In this paper, we propose a Global Intent Prediction and Decomposition (GIPD) framework that enables robots to perform cooperative behavior without relying on communication. Each robot independently infers a globally consistent intent based solely on its local observations, ensuring implicit alignment across the system. Given the inferred global intent, robots autonomously determine their responsibilities and select the most appropriate tasks. They then base their local decision-making on the global intent, selected tasks, and individual observations, thereby facilitating effective execution and cooperation. We validate our approach using the MPE and SMAC benchmarks. Additionally, real-world experiments involving multiple ships demonstrate the effectiveness and practical applicability of the proposed GIPD method.

15:10-15:15, Paper WeCT5.3
YOLO-MARL: You Only LLM Once for Multi-Agent Reinforcement Learning

Zhuang, Yuan	University of Connecticut
Shen, Yi	University of Pennsylvania
Zhang, Zhili	University of Connecticut
Chen, Yuxiao	Nvidia Research
Miao, Fei	University of Connecticut
Keywords: AI-Based Methods, Deep Learning Methods, Multi-Robot Systems Abstract: Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, trained decentralized policies based on normal-sized neural networks operate independently of the LLM. We evaluate our method across two different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms.

15:15-15:20, Paper WeCT5.4
FlowMP: Learning Motion Fields for Robot Planning with Conditional Flow Matching

Nguyen, Khang	University of Texas at Arlington
Le, An Thai	Technische Universität Darmstadt
Pham, Canh An Tien	The University of Manchester
Huber, Manfred	University of Texas at Arlington
Peters, Jan	Technische Universität Darmstadt
Vu, Minh Nhat	TU Wien, Austria
Keywords: AI-Based Methods, Imitation Learning Abstract: Prior flow matching methods in robotics have primarily learned velocity fields to morph one distribution of trajectories into another. In this work, we extend flow matching to capture second-order trajectory dynamics, incorporating acceleration effects either explicitly in the model or implicitly through the learning objective. Unlike diffusion models, which rely on a noisy forward process and iterative denoising steps, flow matching trains a continuous transformation (flow) that directly maps a simple prior distribution to the target trajectory distribution without any denoising procedure. By modeling trajectories with second-order dynamics, our approach ensures that the generated robot motions are smooth and physically executable, avoiding the jerky or dynamically infeasible trajectories that first-order models might produce. We empirically demonstrate that this second-order conditional flow matching yields superior performance on motion planning benchmarks, achieving smoother trajectories and higher success rates than baseline planners. These findings highlight the advantage of learning acceleration-aware motion fields, as our method outperforms existing motion planning methods in terms of trajectory quality and planning success.

15:20-15:25, Paper WeCT5.5
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

Kong, Yangzhe	George Mason University
Song, Daeun	George Mason University
Liang, Jing	University of Maryland
Manocha, Dinesh	University of Maryland
Yao, Ziyu	George Mason University
Xiao, Xuesu	George Mason University
Keywords: AI-Based Methods, AI-Enabled Robotics, Deep Learning Methods Abstract: We present a novel method, AutoSpatial, an efficient approach with a structured spatial grounding to enhance VLMs' spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs' limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross validation from expert systems in: perception & prediction (up to 10.71%), reasoning (up to 16.26%), action (up to 20.50%), and explanation (up to 18.73%) compared to baseline models trained only on manually annotated data.

15:25-15:30, Paper WeCT5.6
DiffGen: Robot Demonstration Generation Via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

Jin, Yang	Shanghai Jiao Tong University
Lv, Jun	Shanghai Jiao Tong University
Jiang, Shuqiang	Institute of Computing Technology, Chinese Academy of Sciences
Lu, Cewu	ShangHai Jiao Tong University
Keywords: AI-Based Methods, Deep Learning Methods, Data Sets for Robot Learning Abstract: Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation in representation space. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time. The videos of the results can be accessed at https://sites.google.com/view/diffgen.

15:30-15:35, Paper WeCT5.7
Graph2Scene: Versatile 3D Indoor Scene Generation with Interaction-Aware Scene Graph

Chen, Minglin	Sun Yat-Sen University
Yang, Rongkun	Sun Yat-Sen University
Hu, Qibin	Sun Yat-Sen University
Xue, Kaiwen	The Chinese University of Hong Kong, Shenzhen
Zhou, Shunbo	Huawei
Guo, Yulan	Sun Yat-Sen University
Keywords: AI-Based Methods Abstract: Embodied artificial intelligence requires a wide variety of large-scale simulated environments for development. Previous scene reconstruction approaches based on multiview images can produce high-fidelity 3D scenes but lack diversity. In contrast, existing prompt-based scene generation approaches can produce diverse scenes but lack fine-grained controls. To bridge these two fields, the scene graph provides the key relationships within a scene, while offering flexible controls. However, 3D scene generation from scene graphs is challenging and under-explored. In this paper, we propose a scene graph-based 3D indoor scene generation method for the efficient simulated environment creation, that maintains both high diversity and fine-grained control. Specifically, we first introduce an interaction-aware scene graph to merge object nodes with hierarchical interaction relationships, which alleviates levitation and interference issues during the scene generation. Then, we employ the large language model (LLM) for instruction-driven 3D layout generation with carefully designed prompts. Finally, a 3D large generation model is utilized to generate the content for each node of the interaction-aware scene graph, which is then transformed based on the corresponding bounding box in the 3D layout. The experiments demonstrate that the proposed method achieves state-of-the-art performance on 3D indoor scene generation. Additionally, the proposed method exhibits fine-grained controls at the object level, while providing a high diversity of layouts, geometry, and textures.

15:35-15:40, Paper WeCT5.8
Fast Policy: Accelerating Visuomotor Policies without Re-Training

Wu, Tongshu	Tongji University
Wang, Zheng	Tongji University
Keywords: AI-Based Methods, Imitation Learning Abstract: Diffusion models are increasingly employed in visuomotor policies to achieve promising performance of behavior cloning. However, the slow inference caused by iterative denoising is a notorious disadvantage, which greatly limits its application in resource-limited and real-time interactive robot systems. The prevailing strategy to this problem is distillation, but it still requires considerable resources to retrain a student model. To this end, we take another training-free view to develop a novel Fast Policy (termed FP), which can be regarded as a powerful and accelerated alternative to Diffusion Policy for learning visuomotor robot control. Specifically, our comprehensive study of UNet encoder shows that its features change little during inference, prompting us to reuse encoder features in non-critical denoising steps. In addition, we design strategies based on Fourier energy to screen critical and non-critical steps dynamically according to different tasks. Importantly, to mitigate performance degradation caused by the repeated use of non-critical steps, we further introduce a noise correction strategy. Our FP is evaluated on multiple simulation benchmarks and the comparison results with existing speed-up methods demonstrate our effectiveness and superiority with state-of-the-art success rates in visuomotor inference speed. The code is available at https://github.com/xwccchong/Fast-Policy


WeCT6	301
Deep Learning in Grasping and Manipulation 3	Regular Session
Chair: Cai, Yinghao	Institute of Automation, Chinese Academy of Sciences
Co-Chair: Wang, Yu	University of Science and Technology of China

15:00-15:05, Paper WeCT6.1
Physics-Informed Neural Time Fields for Prehensile Object Manipulation

Ren, Hanwen	Purdue University
Ni, Ruiqi	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Deep Learning in Grasping and Manipulation, Representation Learning Abstract: Object manipulation skills are necessary for robots operating in various daily-life scenarios, ranging from warehouses to hospitals. They allow the robots to manipulate the given object to their desired arrangement in the cluttered environment. The existing approaches to solving object manipulations are either inefficient sampling based techniques, require expert demonstrations, or learn by trial and error, making them less ideal for practical scenarios. In this paper, we propose a novel, multimodal physics-informed neural network (PINN) for solving object manipulation tasks. Our approach efficiently learns to solve the Eikonal equation without expert data and finds object manipulation trajectories fast in complex, cluttered environments. Our method is multimodal as it also reactively replans the robot's grasps during manipulation to achieve the desired object poses. We demonstrate our approach in both simulation and real-world scenarios and compare it against state-of-the-art baseline methods. The results indicate that our approach is effective across various objects, has efficient training compared to previous learning-based methods, and demonstrates high performance in planning time, trajectory length, and success rates. Our demonstration videos can be found at https://youtu.be/FaQLkTV9knI.

15:05-15:10, Paper WeCT6.2
MISCGrasp: Leveraging Multiple Integrated Scales and Contrastive Learning for Enhanced Volumetric Grasping

Fan, Qingyu	University of Chinese Academy of Sciences
Cai, Yinghao	Institute of Automation, Chinese Academy of Sciences
Li, Chao	Qiyuan Lab
Jiao, Chunting	Chongqing University
Zheng, Xudong	Qiyuan Lab
Lu, Tao	Institute of Automation, Chinese Academy of Sciences
Liang, Bin	Qiyuan Lab
Wang, Shuo	Chinese Academy of Sciences
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Perception for Grasping and Manipulation Abstract: Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks.

15:10-15:15, Paper WeCT6.3
Exploiting Policy Idling for Dexterous Manipulation

Chen, Annie	Stanford University
Brakel, Philemon	Deepmind
Bronars, Antonia	MIT
Xie, Annie	Stanford University
Huang, Sandy H.	Google DeepMind
Groth, Oliver	University of Oxford
Bauza Villalonga, Maria	Massachusetts Institute of Technology
Wulfmeier, Markus	Google DeepMind
Heess, Nicolas	Google Deepmind
Rao, Dushyant	Google DeepMind
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Learning from Experience Abstract: Learning based methods for dexterous manipulation have made notable progress in recent years, and they can now produce solutions to complex tasks. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i.e. they cease to move beyond a small region of states, often indefinitely, when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e.g., when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e.g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.

15:15-15:20, Paper WeCT6.4
ReBot: Scaling Robot Learning with Real-To-Sim-To-Real Robotic Video Synthesis

Fang, Yu	University of North Carolina at Chapel Hill
Yang, Yue	The University of North Carolina at Chapel Hill
Zhu, Xinghao	University of California, Berkeley
Zheng, Kaiyuan	University of Washington
Bertasius, Gedas	UNC Chapel Hill
Szafir, Daniel J.	University of North Carolina at Chapel Hill
Ding, Mingyu	University of North Carolina at Chapel Hill
Keywords: Deep Learning in Grasping and Manipulation Abstract: Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%.

15:20-15:25, Paper WeCT6.5
Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models

Cao, Jiahang	The Hong Kong University of Science and Technology (Guangzhou)
Zhang, Qiang	The Hong Kong University of Science and Technology (Guangzhou)
Sun, Jingkai	The Hong Kong University of Science and Technology(GZ)
Wang, Jiaxu	Hong Kong University of Science and Technology; Hong Kong Univer
Cheng, Hao	The Hong Kong University of Science and Technology (Guangzhou)
Li, Yulin	Hong Kong University of Science and Technology(HKUST)
Ma, Jun	The Hong Kong University of Science and Technology
Wu, Kun	Syracuse University
Xu, Zhiyuan	Midea Group
Shao, Yecheng	Zhejiang University
Zhao, Wen	Beijing Innovation Center of Humanoid Robotics
Han, Gang	Beijing Innovation Center of Humanoid Robotics
Guo, Yijie	Beijing Innovation Center of Humanoid Robotics
Xu, Renjing	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning Abstract: Diffusion models have been widely employed in the field of 3D manipulation due to their efficient capability to learn distributions, allowing for precise prediction of action trajectories. However, diffusion models typically rely on large parameter UNet backbones as policy networks, which can be challenging to deploy on resource-constrained devices. Recently, the Mamba model has emerged as a promising solution for efficient modeling, offering low computational complexity and strong performance in sequence modeling. In this work, we propose the Mamba Policy, a lighter but stronger policy that reduces the parameter count by over 80% compared to the original policy network while achieving superior performance. Specifically, we introduce the XMamba Block, which effectively integrates input information with conditional features and leverages a combination of Mamba and Attention mechanisms for deep feature extraction. Extensive experiments demonstrate that the Mamba Policy excels on the Adroit, Dexart, and MetaWorld datasets, requiring significantly fewer computational resources. Additionally, we highlight the Mamba Policy's enhanced robustness in long-horizon scenarios compared to baseline methods and explore the performance of various Mamba variants within the Mamba Policy framework. Real-world experiments are also conducted to further validate its effectiveness. Our open-source project page can be found at https://andycao1125.github.io/mamba_policy/.

15:25-15:30, Paper WeCT6.6
Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations

Zhao, Hanyi	Harbin Institute of Technology(Shenzhen)
Zhu, Jinxuan	Harbin Institute of Technology, Shenzhen
Yan, Zihao	Harbin Institute of Technology, ShenZhen
Li, Yichen	Tsinghua Shenzhen International Graduate School
Deng, Yuhong	National University of Singapore
Wang, Xueqian	Center for Artificial Intelligence and Robotics, Graduate School
Keywords: Deep Learning in Grasping and Manipulation, Learning from Demonstration, AI-Based Methods Abstract: Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks.

15:30-15:35, Paper WeCT6.7
High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects

Xue, Jialong	University of Science and Technology of China
Gao, Wei	University of Science and Technology of China
Wang, Yu	University of Science and Technology of China
Ji, Chao	University of Science and Technology of China
Zhao, Dongdong	Lanzhou University
Yan, Shi	Lanzhou University
Zhang, Shiwu	University of Science and Technology of China
Keywords: Deep Learning in Grasping and Manipulation, Visual Servoing Abstract: High-precision tiny object alignment remains a common and critical challenge for humanoid robots in real world. To address this problem, this paper proposes a vision-based framework for precisely estimating and controlling the relative position between a handheld tool and a target object for humanoid robots, e.g., a screwdriver tip and a screw head slot. By fusing images from the head and torso cameras on a robot with its head joint angles, the proposed Transformer-based visual servoing method can correct the handheld tool's positional errors effectively, especially at a close distance. Experiments on M4-M8 screws demonstrate an average convergence error of 0.8-1.3 mm and a success rate of 93%-100%. Through comparative analysis, the results validate that this capability of high-precision tiny object alignment is enabled by the Distance Estimation Transformer architecture and the Multi-Perception-Head mechanism proposed in this paper.

15:35-15:40, Paper WeCT6.8
RobotFingerPrint: Unified Gripper Coordinate Space for Multi-Gripper Grasp Synthesis and Transfer

Khargonkar, Ninad	University of Texas at Dallas
Casas, Luis Felipe	University of Texas at Dallas
Prabhakaran, B	University of Texas at Dallas
Xiang, Yu	University of Texas at Dallas
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Dexterous Manipulation Abstract: We introduce a novel grasp representation named the Unified Gripper Coordinate Space (UGCS) for grasp synthesis and grasp transfer. Our representation leverages spherical coordinates to create a shared coordinate space across different robot grippers, enabling it to synthesize and transfer grasps for both novel objects and previously unseen grippers. The strength of this representation lies in the ability to map palm and fingers of a gripper in the unified coordinate space. Grasp synthesis is formulated as predicting the unified spherical coordinates on object surface points via a conditional variational autoencoder. The predicted unified gripper coordinates establish exact correspondences between the gripper and object points, which is used to optimize grasp pose and joint values. Grasp transfer is facilitated through the point-to-point correspondence between any two (potentially unseen) grippers and solved via a similar optimization. Extensive simulation and real-world experiments showcase the efficacy of the unified grasp representation for grasp synthesis in generating stable and diverse grasps. Similarly, we showcase real-world grasp transfer from human demonstrations across different objects.


WeCT7	307
Motion and Path Planning 7	Regular Session
Chair: Seo, TaeWon	Hanyang University

15:00-15:05, Paper WeCT7.1
Multi-Sets Trees (MST*): Accelerated Asymptotically Optimal Motion Planning Optimization Informed by Multiple Domain Subsets

Zhang, Liding	Technical University of Munich
Wang, Sicheng	Technical University of Munich
Cai, Kuanqi	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Motion and Path Planning, Task and Motion Planning, Manipulation Planning Abstract: Robotic motion planning faces formidable challenges in constrained environments, particularly in rapidly searching for feasible solutions and converging towards optimal. This study introduces Multi-Sets Tree (MST), a sampling-based planner designed to accelerate path searching and solution optimization. MST integrates estimated guided incremental local densification (GuILD) sets that are based on prior estimated solution costs before finding the initial solution. For path optimization, MST* integrates novel beacon selectors to define problem subsets, thereby guiding exploration and effectively exploiting high-potential areas. This multi-set strategy ensures balanced exploration and exploitation, enabling MST* to handle sparse free space. Moreover, MST* utilizes adaptive sampling techniques via Lebesgue's measure of domain subsets for rapid search. MST* improves search efficiency and path optimality, particularly in constrained high-dimensional environments. It extends the informed sampling concept by refining the search region and batch sampling. Experimental results demonstrate that MST* outperforms single-query planners across R^4 to R^16 benchmarks and in real-world robotic navigation tasks. A video showcasing our experimental results is available at: https://youtu.be/0MlIXk1-658.

15:05-15:10, Paper WeCT7.2
Dynamic-Characteristics-Based Continuous Impact-Minimizing Rolling Locomotion for Variable-Topology Truss

Kim, Hanbom	Hanyang Univercity
Seo, TaeWon	Hanyang University
Yim, Mark	University of Pennsylvania
Keywords: Motion and Path Planning, Dynamics, Kinematics Abstract: This paper presents a Continuous Impact-Minimizing (CIM) rolling locomotion method for Variable-Topology Truss (VTT) robots, addressing limitations of conventional stepwise motion. Traditional VTT locomotion depends on discrete reference transitions, resulting in pauses, slow progress, and unintended ground impacts. Inertia-driven rotation at each step also generates impact forces on joints, raising durability concerns. CIM rolling continuously adjusts joint lengths by tracking the center of gravity in real time, enabling smoother motion and minimizing impacts. This approach allows VTTs to move directly to targets without unnecessary resets. Simulations validate the effectiveness of CIM rolling, demonstrating a 50% increase in speed and a 49% reduction in nodal impact force compared to conventional methods.

15:10-15:15, Paper WeCT7.3
Continuous-Time Gradient-Proportional-Integral Flow for Provably Convergent Motion Planning with Obstacle Avoidance

Chen, Jixiang	Beijing Institute of Technology
Liu, Shenyu	Beijing Institute of Technology
Wang, Junzheng	Beijing Institute of Technology
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Underactuated Robots Abstract: This paper presents a novel continuous-time gradient-proportional-integral flow (GPIF) for motion planning with obstacle avoidance. We first frame the motion planning task as a constrained optimization problem, which is relaxed to be an unconstrained optimization problem that can be locally solved via a gradient flow approach using functional analysis. To enforce constraints, the proposed GPIF augments the gradient flow dynamics with proportional and integral feedback terms. Under reasonable assumptions formulated as linear matrix inequalities, we prove that the GPIF can generate optimal control trajectories with guaranteed exponential convergence. Numerical simulations validate the algorithm’s efficacy, focusing on simple car navigation in cluttered environments. Simulations show that even after discretization for practical implementation, the GPIF method retains computational efficiency, enabling both offline planning and real-time online execution.

15:15-15:20, Paper WeCT7.4
Physics-Informed Neural Motion Planning Via Domain Decomposition in Large Environments

Liu, Yuchen	Purdue University
Buynitsky, Alexiy	Purdue University
Ni, Ruiqi	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Motion and Path Planning, Deep Learning Methods, Representation Learning Abstract: Physics-informed Neural Motion Planners (PiNMPs) provide a data-efficient framework for solving the Eikonal Partial Differential Equation (PDE) and representing the cost-to-go function for motion planning. However, their scalability remains limited by spectral bias and the complex loss landscape of PDE-driven training. Domain decomposition mitigates these issues by dividing the environment into smaller subdomains, but existing methods enforce continuity only at individual spatial points. While effective for function approximation, these methods fail to capture the spatial connectivity required for motion planning, where the cost-to-go function depends on both the start and goal coordinates rather than a single query point. We propose Finite Basis Neural Time Fields (FB-NTFields), a novel neural field representation for scalable cost-to-go estimation. Instead of enforcing continuity in output space, FB-NTFields construct a latent space representation, computing the cost-to-go as a distance between the latent embeddings of start and goal coordinates. This enables global spatial coherence while integrating domain decomposition, ensuring efficient large-scale motion planning. We validate FB-NTFields in complex synthetic and real-world scenarios, demonstrating substantial improvements over existing PiNMPs. Finally, we deploy our method on a Unitree B1 quadruped robot, successfully navigating indoor environments.

15:20-15:25, Paper WeCT7.5
TOPP-DWR: Time-Optimal Path Parameterization of Differential-Driven Wheeled Robots Considering Piecewise-Constant Angular Velocity Constraints

Li, Yong	Guangzhou Shiyuan Electronic Technology Co., Ltd
Huang, Yujun	Guangzhou Shiyuan Electronic Technology Co., Ltd
Chen, Yi	Guangzhou Shiyuan Electronic Technology Co., Ltd
Cheng, Hui	Sun Yat-Sen University
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Autonomous Vehicle Navigation Abstract: Differential-driven wheeled robots (DWR) represent the quintessential type of mobile robots and find extensive applications across the robotic field. Most high-performance control approaches for DWR explicitly utilize the linear and angular velocities of the trajectory as control references. However, existing research on time-optimal path parameterization (TOPP) for mobile robots usually neglects the angular velocity and joint velocity constraints, which can result in degraded control performance in practical applications. In this article, a systematic and practical TOPP algorithm named TOPP-DWR is proposed for DWR and other mobile robots. First, the non-uniform B-spline is adopted to represent the initial trajectory in the task space. Second, the piecewise-constant angular velocity, as well as joint velocity, linear velocity, and linear acceleration constraints, are incorporated into the TOPP problem. During the construction of the optimization problem, the aforementioned constraints are uniformly represented as linear velocity constraints. To boost the numerical computational efficiency, we introduce a slack variable to reformulate the problem into second-order-cone programming (SOCP). Subsequently, comparative experiments are conducted to validate the superiority of the proposed method. Quantitative performance indexes show that TOPP-DWR achieves TOPP while adhering to all constraints. Finally, field autonomous navigation experiments are carried out to validate the practicability of TOPP-DWR in real-world applications.

15:25-15:30, Paper WeCT7.6
Vibration-Aware Trajectory Optimization for Mobile Robots in Wild Environments Via Physics-Informed Neural Network

Xu, Aochun	University of Chinese Academy of Sciences
Yang, Andong	Institute of Computing Technology, Chinese Academy of Sciences
Li, Wei	Institute of Computing Technology, Chinese Academy of Sciences
Hu, Yu	Institute of Computing Technology Chinese Academy of Sciences
Keywords: Motion and Path Planning, Vision-Based Navigation Abstract: The suspension system, through effective damping of vibrations and shocks, can enhance the stability of wheeled robots traversing challenging terrain. Because the suspension system decouples the rigid correspondence between terrain changes and robot vibrations, considering suspension modeling in trajectory planning offers the advantage of more accurate prediction of the robot's response to terrain. This improved predictive capability facilitates the planning of safer trajectories and may reduce tracking errors in the subsequent control process. In this work, inspired by the structure of Physics-Informed Neural Network (PINN), we propose a physics-informed planning method that considers the vibrational effects of complex nonlinear suspension systems. In addition, we design a two-stage process to accelerate training. By incorporating PINN, our method can better guarantee the physical feasibility of the planned trajectories. The proposed approach has been evaluated on a real robot platform. Compared to state-of-the-art baseline methods, our proposed approach achieves a 15.38% reduction in hazardous planning for mobile robots in wild environments.

15:30-15:35, Paper WeCT7.7
Manip4Care: Robotic Manipulation of Human Limbs for Solving Assistive Tasks

Koh, Yubin	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Motion and Path Planning, Physical Human-Robot Interaction, Performance Evaluation and Benchmarking Abstract: Enabling robots to grasp and reposition human limbs can significantly enhance their ability to provide assistive care to individuals with severe mobility impairments, particularly in tasks such as robot-assisted bed bathing and dressing. However, existing assistive robotics solutions often assume that the human remains static or quasi-static, limiting their effectiveness. To address this issue, we present Manip4Care, a modular simulation pipeline that enables robotic manipulators to grasp and reposition human limbs effectively. Our approach features a physics simulator equipped with built-in techniques for grasping and repositioning while considering biomechanical and collision avoidance constraints. Our grasping method employs antipodal sampling with force closure to grasp limbs, and our repositioning system utilizes the Model Predictive Path Integral (MPPI) and vector-field-based control method to generate motion trajectories under collision avoidance and biomechanical constraints. We evaluate this approach across various limb manipulation tasks in both supine and sitting positions and compare outcomes for different age groups with differing shoulder joint limits. Additionally, we demonstrate our approach for limb manipulation using a real-world mannequin and further showcase its effectiveness in bed bathing tasks.

15:35-15:40, Paper WeCT7.8
Inspection Planning Primitives with Implicit Models

You, Jingyang	The Australian National University
Kurniawati, Hanna	Australian National University
Medagoda, Lashika	Abyss Solutions
Keywords: Motion and Path Planning Abstract: The aging and increasing complexity of infrastructures makes efficient inspection planning more critical in ensuring safety. Thanks to sampling-based motion planning, many inspection planners are fast. However, they often require huge memory. This is particularly true when the structure under inspection is large and complex, consisting of many struts and pillars of various geometry and sizes. Such structures can be represented efficiently using implicit models, such as neural Signed Distance Functions (SDFs). However, most primitive computations used in sampling-based inspection planner have been designed to work efficiently with explicit environment models, which in turn requires the planner to use explicit environment models or performs frequent transformations between implicit and explicit environment models during planning. This paper proposes a set of primitive computations, called Inspection Planning Primitives with Inspection Models (IPIM), that enable sampling-based inspection planners to entirely use neural SDFs representation during planning. Evaluation on three scenarios, including inspection of a complex real-world structure with over 92M triangular mesh faces, indicates that even a rudimentary sampling-based planner with IPIM can generate inspection trajectories of similar quality to those generated by the state-of-the-art planner, while using up to 70x less memory than the state-of-the-art inspection planner.


WeCT8	308
Micro/Nano Robots 6	Regular Session
Chair: Qin, Fangbo	Institute of Automation, Chinese Academy of Sciences

15:00-15:05, Paper WeCT8.1
Reinforcement Learning-Based Microrobotic Swarm Navigation and Obstacle Avoidance in Partially Observable Environments

Luo, Shengming	Southeast University
An, Xuanyu	Southeast University
Yang, Qijun	Southeast University
Zhang, Haoyu	Southeast University
Zhang, Li	The Chinese University of Hong Kong
Wang, Qianqian	Southeast University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Motion Control Abstract: Microrobotic swarms have shown promising features due to their collective and flexible behaviours, while achieving precise swarm control and autonomous navigation in complex environments remains a challenge. Here, we propose a Transformer-based reinforcement learning strategy that integrates Proximal Policy Optimization for autonomous swarm control in obstacle environments. By combining domain randomization, this strategy enables direct transfer from simulation to real-world without fine tuning. Experimental results demonstrate robust control performance in avoiding static obstacles and tracking the dynamic target, which is not validated in training. The swarm autonomously navigates and adjusts its velocity and trajectory in obstacle environments with an intact swarm pattern. Our work presents a scalable strategy for the deployment of microrobotic swarms with adaptive navigation capability through complex, constrained environments.

15:05-15:10, Paper WeCT8.2
Deep Reinforcement Learning-Based Levitation Control of WirelessCapsule Endoscope by Robotically Driven Permanent Magnet

Huang, Ding	Southern University of Science and Technology
Li, Zongze	Southern University of Science and Technology
Hu, Chengzhi	Southern University of Science and Technology
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Medical Robots and Systems Abstract: Magnetic levitation control provides a promising solution for wireless capsule endoscopy by minimizing tissue pressure and reducing patient discomfort and risks. Compared to electromagnetic actuation systems, using permanent magnets as the actuation source provides stronger magnetic fields at a lower cost. However, permanent magnet-based actuation systems are highly nonlinear and necessitate complex system modeling. In this study, we propose a deep reinforcement learning (DRL)-based control method for permanent magnet levitation. This approach utilizes DRL to learn optimal control strategies in complex and dynamic environments, without the need for detailed modeling of nonlinear physical phenomena such as magnetic interactions, manipulator dynamics, and capsule-environment interactions. A simulation environment was developed where a manipulator equipped with a permanent magnet actuates the internal magnet of a capsule. A multi-stage reward function and a recurrent neural network with memory capabilities were designed to improve control stability and accuracy. After Sim-to-Sim transfer, the proposed method successfully controlled five degrees of freedom, achieving navigation accuracies of 1.68 mm in the training environment and 9.18 mm in the testing environment. The system maintained stable performance and high accuracy while supporting dynamic tracking at speeds of up to 30 mm/s. Additionally, the method demonstrated significant resistance to disturbances.

15:10-15:15, Paper WeCT8.3
Nonlinear Viscoelastic Model-Based Deformation Optimization for Robotic Micropuncture in Retinal Vein Cannulation

Bo, Hu	Nankai University
Li, Ruimin	Nankai University
Xu, Shiyu	Nankai University
Liu, Rongxin	Nankai University
Wang, Zengshuo	Nankai University
Sun, Mingzhu	Nankai University
Zhao, Xin	Nankai University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Contact Modeling Abstract: Micropuncture is a critical step in drug injection during retinal vein cannulation (RVC) surgery. Minimizing deformation during the micropuncture process is beneficial to reduce mechanical damage. However, this goal is challenging due to the viscoelastic characteristics of retinal tissue. In this paper, a robotic micropuncture scheme for deformation optimization that incorporates a nonlinear force model is proposed. Before micropuncture, a preload strategy is utilized to ensure stable contact between needle and retinal vein. Secondly, a nonlinear viscoelastic (NV) model is developed to characterize the nonlinearity and relaxation behavior of the tissue. Finally, a speed optimization framework, based on the NV model and physical constraint, is adopted to minimize deformation. The effectiveness of the proposed scheme is validated through in vitro experiments conducted on open-sky porcine eyes. With average force error of 1.48 uN, stable contact can be achieved via proportion-integral-differential controller. The experimental results demonstrate that the NV model is more suitable for force modeling of retinal tissue. Furthermore, the optimized speed results in an average deformation of 0.5727 mm, which represents a reduction of at least 21.02% compared to the linear model. Thanks to the proposed scheme, the robotic micropuncture based on a varying speed trajectory can reduce deformation and enhance the safety of RVC surgery.

15:15-15:20, Paper WeCT8.4
Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array

Chen, Yitong	Institution of Automation, the Chinese Academy of Sciences
Xu, Xinyao	National University of Defense Technology
Zhu, Ping	Institute of Automation, Chinese Academy of Sciences
Han, Xinyong	Institute of Automation Chinese Academy of Sciences
Qin, Fangbo	Institute of Automation, Chinese Academy of Sciences
Yu, Shan	Institute of Automation, Chinese Academy of Sciences
Keywords: Automation at Micro-Nano Scales, Deep Learning for Visual Perception, Robotics and Automation in Life Sciences Abstract: Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure the reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.

15:20-15:25, Paper WeCT8.5
Using Virtual Input Rejection to Improve Control of a Platform for Characterizing the Mechanical Properties of Human Oocytes

Abadie, Joel	FEMTO-ST CNRS Supmicrotech UMLP
Hernandez, Sylvain	Université Marie Et Louis Pasteur, SUPMICROTECH, CNRS, Institut
Piat, Emmanuel	Femto-St Institute CNRS UMR 6174
Keywords: Automation at Micro-Nano Scales, Optimization and Optimal Control, Medical Robots and Systems Abstract: Mechanical characterization of human oocytes holds great promise for improving the chances of pregnancy in assisted reproduction programs. However, the development of a high-performance device comes up against the numerous biological and normative constraints of medically assisted reproduction technology (ART). The patented EggSense platform overcomes these difficulties, enabling mechanical characterization of human oocytes under clinical conditions. In this article, the focus is on the significant improvements achieved on EggSense by deploying advanced control techniques, based on Virtual Input Rejection COntrol (VIRCO). This approach is used to control the position of the force-sensing element in contact with the oocyte. Its implementation is fully described and experimentally validated. A comparison with a conventional controller is also provided to illustrate some of the benefits of VIRCO.

15:25-15:30, Paper WeCT8.6
High-Precision Parallel Manipulation of Multi-Particle System Using Optoelectronic Tweezers

Huang, Shunxiao	Beihang University
Zhao, Jiawei	Beihang University, School of Mechanical Engineering and Automati
Gan, Chunyuan	Beihang University
Zeng, Zijin	Beihang University
Xiong, Hongyi	Beihang University
Ye, Jingwen	Beihang University
Niu, Wenyan	Beihang University
Wang, Ao	Beihang University
Li, Chan	Beihang University
Sun, Hongyan	Beihang University
Chen, Zaiyang	The Chinese University of Hong Kong
Guo, Yingjian	Beihang University
Feng, Lin	Beihang University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Biological Cell Manipulation Abstract: This paper presents a multi-particle parallel manipulation optoelectronic tweezers system integrated with computer vision technology, enabling the parallel and precise manipulation of dozens of particles. This system significantly enhances manipulation efficiency while maintaining high precision. By real-time monitoring of particle motion and light patterns, the system can rapidly adjust and optimize its manipulation strategy, thereby improving the stability and reliability of multi-particle synchronization in complex environments. Extensive experimental results demonstrate the system's outstanding performance. For instance, it can quickly arrange complex patterns and letter sequences, facilitate the coordinated assembly of organoids from particle groups, and efficiently perform the precise separation and arrangement of mixed particles. The core advantage of this system lies in its high parallelism and flexibility, enabling it to handle large-scale synchronous manipulation tasks with exceptional operating accuracy. With continuous technological advancements and the broadening of application scenarios, this system is expected to have a profound impact in fields such as cell sorting, micro-device assembly, and organoid construction, providing robust support for research and technological development in these areas.

15:30-15:35, Paper WeCT8.7
Enhanced Automated Cell Micromanipulation Via Programmable Magnetic Microgripper Design (I)

Zhang, Youchao	Zhejiang University
Wang, Fanghao	Zhejiang University
Ye, Yuqian	Zhejiang University
Guo, Xiangyu	Zhejiang University
Wang, Xiao	Zhejiang University
Knoll, Alois	Technical University of Munich
Wang, Yixian	Zhejiang University
Dai, Changsheng	Dalian University of Technology
Ying, Yibin	Zhejiang University
Zhou, Mingchuan	Zhejiang University
Keywords: Biological Cell Manipulation, Micro/Nano Robots, Dexterous Manipulation Abstract: Achieving efficient and robust grasping manipulation in microscopic scenarios is crucial for robotic applications. As a typical end-effector, the microgripper that can perform gripping-holding-releasing operations, plays an essential role in the area of micromanipulation. To facilitate the development of industrial microassembly and cell micromanipulation technologies, we propose a robotic microgripper (RM) with a two-finger structure based on magnetic programmable soft materials. First, we synthesize and characterize the magnetic programmable soft material, and fabricate the microfinger based on the soft material. The analytical model of the microgripper was developed using beam deformation theory and magnetic field modeling. By combining the deformation mechanism of flexible microfingers with visual feedback, we modeled the gripping force of the microgripper and proposed a feedback adaptive grasping strategy (FAGS). The cooperative control of two flexible microfingers via remote magnetic field actuation achieves the grasping of microobjects with various complex shapes and in different media. A micromanipulation robotic system was constructed by combining RM with a micromanipulation arm, enabling targeted grasping, transportation, and posture control of microobjects in complex scenarios.

15:35-15:40, Paper WeCT8.8
A Compact 2-DOF Cross-Scale Piezoelectric Robotic Manipulator with Adjustable Force for Biological Delicate Puncture

Gao, Xiang	Harbin Institute of Technology
Deng, Jie	Harbin Institute of Technology
Wang, Weiyi	Harbin Institute of Technology
Chang, Qingbing	Harbin Institute of Technology
Sun, Jianhua	Harbin Institute of Technology
Junkao, Liu	Harbin Institute of Technology
Zhang, Shijing	Harbin Institute of Technology
Liu, Yingxiang	Harbin Institute of Technology
Keywords: Biological Cell Manipulation, Automation at Micro-Nano Scales, Medical Robots and Systems, piezoelectric robotic manipulator Abstract: Robotic micromanipulation technology plays a fundamental supporting role in biomedical engineering field. However, it is still a challenge for robotic manipulators to meet the puncture performance requirements for complex cells and curved capillaries with different sizes, because they are difficult to achieve a comprehensive characteristic of multi-DOF, long stroke, high displacement resolution, large puncture force, and high force resolution for limitations of structures, driving elements, and actuation methods. To addresses this challenge, this work proposes the conceptual design of a piezoelectric robotic manipulator driven by a single piezoelectric actuator. A configuration design idea and a new actuation method are elaborated to achieve 2-DOF cross-scale manipulation and adjustable puncture force. Theoretical analyses and simulations are carried out to investigate the influence of key structural parameters on displacement response and puncture force, as well as to determine the parameters. A prototype is fabricated, a dedicated handheld controller is developed, and a robotic micropuncture system is constructed to conduct characteristic testing and application research. Experimental results reveal that the manipulator achieves linear and rotary strokes of 38.5 mm and 360°, displacement resolutions of 48 nm and 0.38 μrad, a puncture force range from 1.70 mN to 301.34 mN, and a force resolution of 0.13 mN. Additionally, the manipulator successfully performs delicate puncture of silicone capillaries with different sizes and a curved silicone capillary under collaborative manipulation of the piezoelectric platform. This work exhibits a high-performance piezoelectric robotic manipulator and verifies its feasibility in micropuncture of tiny and complex-shaped organisms.


WeCT9	309
Object Detection, Segmentation and Categorization 3	Regular Session
Chair: Yuan, Xia	Nanjing University of Science and Technology
Co-Chair: Lian, Wenzhao	Google X

15:00-15:05, Paper WeCT9.1
AutoSelecter: Efficient Synthetic Nighttime Images Make Object Detector Stronger

Meng, Chao	Shanghai Jiaotong University
Wang, Mengjie	Z-One Technology Co., Ltd
Shi, Wenxiu	Z-One Technology Co., Ltd
Zhu, Huiping	Z-One Technology Co., Ltd
Zhang, Song	Z-One Technology Co., Ltd
Zhang, Rui	Z-One Technology Co, Ltd
Yang, Ming	Shanghai Jiao Tong University
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: Object detection has achieved significant advancements despite the challenges posed by adverse conditions like low-light nighttime environments, where annotated data is not only scarce but also challenging to accurately label. Instead of designing special network, we focus on the creation and efficient utilization of synthetic data to address the problem. We generate synthetic data by employing an enhanced generative model that adeptly transforms daytime images into low-light nighttime ones. Furthermore, we introduce a data selection scheme, named AutoSelecter, which can be flexibly integrated into the training process of object detector, ensuring the selection of the most effective synthetic data. By efficiently utilizing synthetic data, our strategy achieves an average improvement of 5.2% and 6.1% in AP 50 on the nighttime datasets of BDD100k and Waymo, respectively, for the YOLOv7, YOLOv8, and RT-DETR object detectors. We have insightfully discovered numerous missed and mislabeled annotations in manually annotated low-light nighttime datasets, which can significantly interfere with the accuracy of evaluation results during nighttime. Consequently, we also provide a manually annotated and more accurate dataset BDD100kValNight+ for better evaluation. On this refined dataset, our strategy achieves an average improvement of 5.1% in AP 50 on the three detectors. The label of BDD100kValNight+ is avaliable at https://github.com/ wmjlincy/BDD100kValNight-dataset.

15:05-15:10, Paper WeCT9.2
HFDNet: High-Frequency Divergence Attention Network for Underwater Segmentation

Xie, Hongbo	Beihang University
Zhao, Qi	Beihang University
Liu, Binghao	Beihang University
Wang, Chunlei	School of Electronic and Information Engineering, Beihang Univer
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning Methods Abstract: Currently, most underwater operations are conducted in deep water, and there is usually insufficient illumination in these areas. At this time, the local texture features of some objects are highly similar in images, and it is difficult to distinguish the inter-class boundaries. This typically results in poor performance of the current semantic segmentation models of terrestrial images in underwater scenes. Taking advantage of the general characteristic that high-frequency regions are more likely to correspond to semantic segmentation boundaries, we introduce the high-frequency Divergence Attention Network (HFDNet), a semantic segmentation model based on transformer. HFDNet extracts its frequency distribution by analyzing the frequency domain of the feature map, and then calculates the relative spectral magnitude of each component by comparing its frequency amplitude against the average amplitude within its local neighborhood in the frequency domain. The local frequency map can be incorporated into the attention matrix as a weighting factor to realize the divergence of attention to the surrounding areas, which improves the attention to the high-frequency areas. This operation can enhance the model's focus on the object boundary region and local neighborhood categories for each component. Therefore, our model can alleviate the problem of determining the object boundary caused by insufficient light in underwater image segmentation, and enhance the ability to segment objects with similar local features under low light conditions. We conduct comprehensive experiments on three underwater segmentation datasets: Caveseg, SUIM and UWS. The results show that our HFDNet achieves state-of-the-art (SOTA) performance on the testing datasets. The source code is available at https://github.com/cv516Buaa/HongboXie/tree/main/HFDNe t.

15:10-15:15, Paper WeCT9.3
Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation

Luo, Jun	Peking University
Zhao, Zijing	Peking University
Liu, Yang	Peking University
Keywords: Transfer Learning, Object Detection, Segmentation and Categorization Abstract: Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA(Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation,in which no target images are available, but only a text description of the target domain’s style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to- image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA

15:15-15:20, Paper WeCT9.4
Splatter Joint: 3D Gaussian Splatting for Articulated Objects

Li, Junyan	Institute of Automation Chinese Academy of Sciences
Han, Yifan	Chinese Academy of Sciences
Yi, Pengfei	Chinese Academy of Sciences
Lian, Wenzhao	Google X
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, RGB-D Perception Abstract: 3D reconstruction methods such as 3D Gaussian Splatting (3DGS), have achieved significant advancements in recent years. However, the study of articulated objects remains limited due to their geometric and dynamic complexity. We propose Splatter Joint, a novel method that models articulated objects, particularly focusing on joints, to capture both the appearance and the geometric information from a few images taken at a single viewpoint. By integrating joint parameters into the 3DGS rendering process in a differentiable manner, we enable the prediction of joint movements while enhancing the accuracy of object appearance reconstruction. We evaluated Splatter Joint on existing and newly created datasets, demon- strating its effectiveness in modeling object appearance and geometry simultaneously.

15:20-15:25, Paper WeCT9.5
Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios

Ahmed, Taufiq	University of Oulu
Kumar, Abhishek	University of Jyväskylä
Alvarez Casado, Constantino	University of Oulu
Zhang, Anlan	University of Southern California
Hänninen, Tuomo	University of Oulu
Loven, Lauri	University of Oulu
Bordallo Lopez, Miguel	University of Oulu
Tarkoma, Sasu	University of Helsinki
Keywords: Object Detection, Segmentation and Categorization, Aerial Systems: Applications, AI-Based Methods Abstract: Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring. The code is available at: url{https://github.com/futurians/E-IRFS}.

15:25-15:30, Paper WeCT9.6
DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation

Li, Wanmeng	University of Padova
Mosco, Simone	Università Degli Studi Di Padova
Fusaro, Daniel	Department of Information Engineering (DEI), University of Padov
Pretto, Alberto	University of Padova
Keywords: Object Detection, Segmentation and Categorization, Range Sensing, Computer Vision for Transportation Abstract: Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.

15:30-15:35, Paper WeCT9.7
Rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding

Qian, Howard H.	Rice University
Chen, Yiting	Rice University
Wang, Gaotian	Rice University
Chanrungmaneekul, Podshara	Rice University
Hang, Kaiyu	Rice University
Keywords: Object Detection, Segmentation and Categorization, Perception for Grasping and Manipulation Abstract: Successful execution of dexterous robotic manipulation tasks in new environments, such as grasping, depends on the ability to proficiently segment unseen objects from the background and other objects. Previous works in unseen object instance segmentation (UOIS) train models on large-scale datasets, which often leads to overfitting on static visual features. This dependency results in poor generalization performance when confronted with out-of-distribution scenarios. To address this limitation, we rethink the task of UOIS based on the principle that vision is inherently interactive and occurs over time. We propose a novel real-time interactive perception framework, rt-RISeg, that continuously segments unseen objects by robot interactions and analysis of a designed body frame-invariant feature (BFIF). We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model. This fully self-contained segmentation pipeline generates and updates object segmentation masks throughout each robot interaction without the need to wait for an action to finish. We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state-of-the-art UOIS methods. Furthermore, although rt-RISeg is a standalone framework, we show that the autonomously generated segmentation masks can be used as prompts to vision foundation models for significantly improved performance.


WeCT10	310
Recognition 2	Regular Session
Chair: Bian, Gui-Bin	Institute of Automation, Chinese Academy of Sciences

15:00-15:05, Paper WeCT10.1
Efficient Underwater Object Detection with Enhanced Feature Extraction and Fusion (I)

Li, Shaoming	The Chinese University of Hongkong, Shenzhen
Wang, Ziyi	The Chinese University of Hong Kong
Dai, Rong	Motic Hong Kong
Wang, Yaqing	The Chinese University of Hong Kong
Liu, Yunhui	Chinese University of Hong Kong
Zhong, Fangxun	The Chinese University of Hong Kong, Shenzhen
Keywords: Recognition, AI-Based Methods, Deep Learning Methods Abstract: Underwater object detection is critical for applications such as environmental monitoring, resource exploration, and the navigation of autonomous underwater vehicles. However, accurately detecting small objects in underwater environments remains challenging due to noisy imaging conditions, variable illumination, and complex backgrounds. To address these challenges, we propose the Adaptive Residual Attention Network (ARAN), an optimized deep learning framework designed to enhance the detection and precise identification of diminutive targets in complex aquatic settings. ARAN incorporates the proposed Fusion PANet, which refines spatial features by effectively distinguishing objects from their backgrounds. The framework integrates three novel modules: (1) Multi-Scale Feature Attention, which enhances low-level feature extraction; (2) High-Low Feature Residual Learning, which rearranges channel and batch dimensions to capture pixel-level relationships through cross-dimensional interactions; and (3) Multi-Level Feature Dynamic Aggregation, which dynamically adjusts fusion weights to facilitate progressive multi-level feature fusion and mitigate conflicts in multiscale integration, ensuring that small objects are not overshadowed. Extensive experiments on four benchmark datasets demonstrate that ARAN significantly outperforms mainstream models, achieving state-of-the-art performance. Notably, on the CSIRO dataset, ARAN attains a mean Average Precision at 50% (mAP50) of 98%, precision of 94.7%, F2-score of 94.6%, and recall of 94.7%. These results confirm our model’s superior accuracy, robustness, and efficiency in underwater object detection, highlighting its potential for practical deployment in challenging aquatic environments. We will release the code on GitHub upon acceptance of the paper.

15:05-15:10, Paper WeCT10.2
A Multimodal Robust Recognition Method for Grasping Objects with Robot Flexible Grippers (I)

Liang, Qiaokang	Hunan University; University of Ontario Institute of Technology
Xiao, WenXing	Hunan University
Long, Jianyong	Hunan University
Zhang, Dan	The Hong Kong Polytechnic University
Keywords: Recognition, Multifingered Hands, Force and Tactile Sensing Abstract: In light of the critical importance of achieving robust object recognition from multimodal data in robotic operations, this article proposes a precise identification method tailored for the grasping of objects using a multi-flexible gripper in scenarios characterized by multimodality, limited samples, and complex environments. Terming the BOSS-MI-ELM algorithm, this approach innovatively extracts [bag-of-SFA-symbols (BOSS)], fusion [association-based fusion (AF)], and classifies [incremental extreme learning machine (I-ELM)] features from multimodal data, facilitating an efficient recognition process. The study employs fiber Bragg grating (FBG) and inertial measurement unit (IMU) as information acquisition components, constructing a multimodal perception system and establishing a corresponding grasping dataset. Through training and testing on this dataset, empirical evidence demonstrates that even with the utilization of only 20% of the dataset, the BOSS-MI-ELM algorithm maintains a classification accuracy of 95.54%. In the presence of Gaussian noise with a mean of 0 and varying standard deviations, as well as different degrees of partial data loss, the proposed method still maintains robust recognition performance. In addition, we have validated the effectiveness of this method in identifying objects grasped at different speeds. Furthermore, comparative experiments were conducted on two publicly available multimodal tactile datasets. The results indicate that the BOSS-MI-ELM algorithm outperforms various baseline models. The extensive experiments collectively demonstrate that this system provides a viable solution for robot object recognition under multimodal tactile perception.

15:10-15:15, Paper WeCT10.3
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset

Chen, Zihao	National Chengchi University
Wu, Hsuan-Yu	National Chengchi University
Kung, Chi-Hsi	National Tsing Hua University
Chen, Yi-Ting	National Yang Ming Chiao Tung University
Peng, Yan-Tsung	National Chengchi University
Keywords: Recognition, Computer Vision for Transportation, Deep Learning Methods Abstract: Traffic Atomic Activity, which describes traffic patterns for topological intersection dynamics, is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects' activities. We further provide a comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognition of atomic activity in an aerial view. Our source code and dataset are available at https://github.com/magecliff96/ATARS/.

15:15-15:20, Paper WeCT10.4
Dynamic Action Localization and Recognition for Intelligent Perception of Surgical Robots

Peng, Yaqin	Institute of Automation, Chinese Academy of Sciences
Bian, Gui-Bin	Institute of Automation, Chinese Academy of Sciences
Li, Zhen	Institute of Automation, Chinese Academy of Sciences
Ma, Ruichen	Institute of Automation, Chinese Academy of Sciences
Ye, Qiang	Institute of Automation, Chinese Academy of Sciences
Keywords: Recognition, Computer Vision for Medical Robotics, Vision-Based Navigation Abstract: Robot-assisted surgery has significantly advanced surgical precision, yet the development of autonomous surgical robots remains hindered by their limited understanding of complex surgical actions. Current systems lack the ability to effectively perceive and interpret intricate surgical relationships, which restricts their capability to assist surgeons in dynamic surgical environments. To overcome these challenges, a novel self-supervised learning method for surgical action recognition has been proposed, aimed at enhancing the understanding of surgical actions. The method has introduced a dynamic masking with attention-based action localization module to focus the model on critical spatial regions where actions occur, enabling surgical view guidance for intelligent surgical robot while extracting key features. Moreover, a graph-enhanced adaptive feature selection module is employed to assign relevance to features and capture the temporal relationships between adjacent frames. LSTM has been utilized to model long-term dependencies across video sequences, while multi-view contrastive learning facilitates the extraction of discriminative features from both masked and unmasked sequences. Experimental results demonstrate a 3.4% improvement in AP and an ROC-AUC of 92.9% on Neuro67 dataset for surgical action recognition. The method enables dynamic adjustments to surgical view, achieving surgical visual navigation. These advancements contribute to the development of intelligent and autonomous surgical robots capable of assisting surgeons in complex and dynamic surgical settings.

15:20-15:25, Paper WeCT10.5
3DWSNet: A Novel 3D Wavelet Spiking Neural Network for Event-Based Action Recognition

Junkang, Fang	Beijing University of Posts and Telecommunications
Dang, Yonghao	Beijing University of Posts and Telecommunications
Wending, Zhao	Beijing University of Posts and Telecommunications
Bo, Yu	Beijing University of Posts and Telecommunications
Wang, Zehao	Beijing University of Posts and Telecommunications
Jianqin, Yin	School of Artificial Intelligence, Beijing University of Posts A
Keywords: Recognition, Deep Learning Methods, Deep Learning for Visual Perception Abstract: In robotics applications, event cameras provide low-latency and high-dynamic-range sensing by asynchronously detecting brightness changes, making them well-suited for capturing fast motions and subtle cues in dynamic environments. However, most existing Spiking Neural Network (SNN)-based methods enhance spatial information by stacking multiple frames of events, while neglecting the explicit modeling of high- and low-frequency components in the event stream. To address this limitation, we proposes a 3D Wavelet Spiking Neural Network (3DWSNet), which integrates a 3D wavelet transform with a cascaded Wavelet Spiking Convolution (WSC) module as its core. Specifically, the 3D wavelet transform decomposes input data into eight frequency sub-bands across spatial and temporal dimensions, enabling the model to preserve fine-grained high-frequency details while enriching low-frequency motion representations. The cascaded WSC architecture further improves the extraction of multi-scale spatio-temporal features by integrating information from feature maps at different resolutions. Extensive experiments show that our 3DWSNet significantly outperforms SOTA SNN performances on the CIFAR-10, CIFAR-100, DVS128 Gesture, and CIFAR10-DVS datasets. The source code will be publicly released soon.

15:25-15:30, Paper WeCT10.6
Body-Hand Modality Expertized Networks with Cross-Attention for Fine-Grained Skeleton Action Recognition

Cho, Seungyeon	KAIST
Kim, Tae-Kyun	KAIST
Keywords: Recognition, Multi-Modal Perception for HRI, Human-Robot Collaboration Abstract: Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human–robot interaction. However, most existing methods concentrate primarily on full‐body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body–Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies—improving from 86.4% to 93.0% in hand-intensive actions—while maintaining fewer GFLOPs and parameters than the relevant unified methods.

15:30-15:35, Paper WeCT10.7
Cross-Source-Context Indoor RGB-D Place Recognition

Liang, Jing	University of Maryland
Deng, Zhuo	Amazon
Zhou, Zheming	Amazon.com LLC
Sun, Min	National Tsing Hua University
Ghasemalizadeh, Omid	Amazon Lab126
Kuo, Cheng-Hao	Amazon
Sen, Arnab	Amazon
Manocha, Dinesh	University of Maryland
Keywords: Recognition, RGB-D Perception, Deep Learning for Visual Perception Abstract: We extend our previous work, PoCo, and present a new algorithm, Cross-Source-Context Place Recognition (CSCPR), for RGB-D indoor place recognition that integrates global retrieval and reranking into an end-to-end model and keeps the consistency of using Context-of-Clusters (CoCs) for feature processing. Unlike prior approaches that primarily focus on the RGB domain for place recognition reranking, CSCPR is designed to handle the RGB-D data. We apply the CoCs to handle cross-sourced and cross-scaled RGB-D point clouds and introduce two novel modules for reranking: the Self-Context Cluster (SCC) and the Cross Source Context Cluster (CSCC), which enhance feature representation and match query-database pairs based on local features, respectively. We also release two new datasets, ScanNetIPR and ARKitIPR. Our experiments demonstrate that CSCPR significantly outperforms state-of-the-art models on these datasets by at least 29.27% in Recall@1 on the ScanNet-PR dataset and 43.24% in the new datasets. Code and datasets will be released.


WeCT11	311A
Reinforcement Learning 7	Regular Session
Chair: Ruchkin, Ivan	University of Florida
Co-Chair: Walter, Florian	Technical University Munich

15:00-15:05, Paper WeCT11.1
Using Goal-Conditioned Reinforcement Learning with Deep Imitation to Control Robot Arm in Flexible Flat Cable Assembly Task (I)

Li, Jingchen	Beijing Academy of Agriculture and Forestry Sciences
Shi, Hao-Bin	Northwestern Polytechnical University, School of Computer Science
Hwang, Kao-Shing	Dept. Electrical Engineering, National Sun Yat-Sen University
Wu, Huarui	Beijing Academy of Agriculture and Forestry Sciences
Keywords: Reinforcement Learning, Imitation Learning, AI-Based Methods Abstract: Leveraging reinforcement learning on high-precision decision-making in Robot Arm assembly scenes is a desired goal in the industrial community. However, tasks like Flexible Flat Cable (FFC) assembly, which require highly trained workers, pose significant challenges due to sparse rewards and limited learning conditions. In this work, we propose a goal-conditioned self-imitation reinforcement learning method for FFC assembly without relying on a specific end-effector, where both perception and behavior plannings are learned through reinforcement learning. We analyze the challenges faced by Robot Arm in high-precision assembly scenarios and balance the breadth and depth of exploration during training. Our end-to-end model consists of hindsight and self-imitation modules, allowing the Robot Arm to leverage futile exploration and optimize successful trajectories. Our method does not require rule-based or manual rewards, and it enables the Robot Arm to quickly find feasible solutions through experience relabeling, while unnecessary explorations are avoided. We train the FFC assembly policy in a simulation environment and transfer it to the real scenario by using domain adaptation. We explore various combinations of hindsight and self-imitation learning, and discuss the results comprehensively. Experimental findings demonstrate that our model achieves fast and advanced flexible flat cable assembly, surpassing other reinforcement learning-based methods.

15:05-15:10, Paper WeCT11.2
Offline Imitation Learning Upon Arbitrary Demonstrations by Pre-Training Dynamics Representations

Ma, Haitong	Harvard University
Dai, Bo	Google Brain
Ren, Zhaolin	Harvard University
Wang, Yebin	Mitsubishi Electric Research Laboratories
Li, Na	Harvard University
Keywords: Representation Learning, Imitation Learning, Machine Learning for Robot Control Abstract: Limited data has become a major bottleneck in scaling up offline imitation learning (IL). In this paper, we propose enhancing IL performance under limited expert data by introducing a pre-training stage that learns dynamics representations, derived from factorizations of the transition dynamics. We first theoretically justify that the optimal decision variable of offline IL lies in the representation space, significantly reducing the parameters to learn in the downstream IL. Moreover, the dynamics representations depend only on transition dynamics and thus can be learned from arbitrary data collected with the same dynamics, allowing the reuse of massive non-expert data and mitigating the limited data issues. We present a tractable loss function inspired by noise contrastive estimation to learn the dynamics representations at the pre-training stage. Experiments on MuJoCo demonstrate that our proposed algorithm can mimic expert policies with as few as a single trajectory. Experiments on real quadrupeds show that we can leverage pre-trained dynamics representations from simulator data to learn to walk from a few real-world demonstrations.

15:10-15:15, Paper WeCT11.3
Context-Based Meta Reinforcement Learning for Robust and Adaptable Peg-In-Hole Assembly Tasks

Shokry, Ahmed	University of Bonn
Gomaa, Walid	Egypt Japan University of Science and Technology
Zaenker, Tobias	Neura Robotics
Dawood, Murad	University of Bonn
Menon, Rohit	University of Bonn
A.Maged, Shady	Ain Shams University
Awad, Mohammed	Ain Shams University
Bennewitz, Maren	University of Bonn
Keywords: Representation Learning, Learning from Experience Abstract: Autonomous assembly is an essential capability for industrial and service robots, with Peg-in-Hole (PiH) insertion being one of the core tasks. However, PiH assembly in unknown environments is still challenging due to uncertainty in task parameters, such as the hole position and orientation, resulting from sensor noise. Although context-based meta reinforcement learning (RL) methods have been previously presented to adapt to unknown task parameters in PiH assembly tasks, the performance depends on a sample-inefficient procedure or human demonstrations. Thus, to enhance the applicability of meta RL in real-world PiH assembly tasks, we propose to train the agent to use information from the robot’s forward kinematics and an uncalibrated camera. Furthermore, we improve the applicability by efficiently adapting the meta-trained agent to use data from force/torque sensor. Finally, we propose an adaptation procedure for out-of-distribution tasks whose parameters are different from the training tasks. Experiments on simulated and real robots prove that our modifications enhance the sample efficiency during meta training, real-world adaptation performance, and generalization of the context-based meta RL agent in PiH assembly tasks compared to previous approaches.

15:15-15:20, Paper WeCT11.4
Unsupervised Anomaly Detection Improves Imitation Learning for Autonomous Racing

Geng, Yuang	University of Florida
Zhou, Yang	University of Florida
Zhang, Yuyang	University of Florida
Zhang, Zhongzheng Ren	University of Florida
Yang, Kang	University of Florida
Ruble, Tyler	University of Florida
Vidal, Giancarlo	University of Florida
Ruchkin, Ivan	University of Florida
Keywords: Representation Learning, Imitation Learning, Computer Vision for Transportation Abstract: Imitation Learning (IL) has shown significant promise in autonomous driving, but its performance heavily depends on the quality of training data. Noisy or corrupted sensor inputs can degrade learned policies, leading to unsafe behavior. This paper presents an unsupervised anomaly detection approach to automatically filter out abnormal images from driving datasets, thereby enhancing IL performance. Our method leverages a Convolutional Autoencoder with a novel latent reference loss, which forces abnormal images to reconstruct with higher errors than normal images. This enables effective anomaly detection without requiring manually labeled data. We validate our approach on the realistic DonkeyCar autonomous racing platform, demonstrating that filtering videos significantly improves IL policies, as measured by a 25-40% reduction in cross-track error. Compared to baseline and ablation models, our method achieves superior anomaly detection across three real-world video corruptions: collision-based occlusions, transparent obstructions, and raindrop interference. The results highlight the effectiveness of unsupervised video anomaly detection in improving the robustness and performance of IL-based autonomous control.

15:20-15:25, Paper WeCT11.5
EASEIR: Efficient and Adaptive Safe-Set Estimation Via Implicit Representation for High-Dimensional Motion Planning

Lee, Hojun	Purdue University
Sim, Yuseop	Purdue University
Han, Changheon	Purdue University
Lee, Jiho	Purdue University
Bera, Aniket	Purdue University
Kim, Chang-Ju	Korea Institute of Machinery & Materials
Jun, Martin	Purdue University
Keywords: Representation Learning, Task and Motion Planning, Robot Safety Abstract: Collision-free robotic manipulation is extremely important for all safety-critical applications of robots. Especially for large-scale automation in modern manufacturing facilities where numerous hardware and software systems collaborate in relatively structured environments, accomplishing effectiveness, efficiency, and safety in not only repetitive tasks but also their sporadic reconfigurations is ideal. Yet, existing online and offline Motion Planning (MP) algorithms do not meet such a unique combination of harsh demands, since most of the advances in MP aim for a subset of the requirements. To bridge the gap, we introduce a novel implicit neural function (EASEIR) designed for efficient offline safe set composition for robotic manipulators operating in structured environments. Addressing the challenges of managing high-dimensional configuration spaces (C-space), EASEIR leverages Implicit Neural Representations (INR) to relate coordinates of a discretized robot operation space with collision sets in C-space. EASEIR then utilizes the mapping to actively compose a collision-free set in response to arbitrary occupancy of the operation space by obstacles. The proposed method comprises three core modules: (a) Latent Key Generator (LKG) that maps the coordinates of the space to intermediate latent keys, (b) Latent Key Decoder (LKD) that reconstructs collision sets from the keys, and (c) Full Set Compositor (FSC) that generates a full collision-free set using set operations. On a 6 Degrees of Freedom (DoF) arm, EASEIR generates safe configuration sets nearly 43 times faster than the state-of-the-art analytical method while maintaining comparable accuracy (∼0.2% collision) during evaluations in a simulation environment.

15:25-15:30, Paper WeCT11.6
TAR: Teacher-Aligned Representations Via Contrastive Learning for Quadrupedal Locomotion

Mousa, Amr	University of Manchester
Karavis, Neil	BAE Systems
Caprio, Michele	The University of Manchester
Pan, Wei	The University of Manchester
Allmendinger, Richard	The University of Manchester
Keywords: Representation Learning, Reinforcement Learning, Legged Robots Abstract: Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between privileged teacher and proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation; lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged “Teacher”. Results showed accelerated training by 2× compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40% on average compared to existing methods. Moreover, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at https://amrmousa.com/TARLoco/

15:30-15:35, Paper WeCT11.7
Refined Policy Distillation: From VLA Generalists to RL Experts

Jülg, Tobias Thomas	University of Technology Nuremberg
Burgard, Wolfram	University of Technology Nuremberg
Walter, Florian	Technical University Munich
Keywords: Reinforcement Learning, Imitation Learning, Transfer Learning Abstract: Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io

15:35-15:40, Paper WeCT11.8
Policy Learning from Large Vision-Language Model Feedback without Reward Modeling

Luu, Tung	Korea Advanced Institute of Science and Technology
Lee, Donghoon	Korea Advanced Institute of Science & Technology
Lee, Younghwan	Korea Advanced Institute of Science and Technology
Yoo, Chang D.	KAIST
Keywords: Reinforcement Learning, Learning from Experience, AI-Enabled Robotics Abstract: Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.


WeCT12	311B
Robotic Imitation Learning 3	Regular Session
Chair: Li, Guangliang	Ocean University of China
Co-Chair: Ma, Xiaoguang	Northeastern University

15:00-15:05, Paper WeCT12.1
Towards Safe Imitation Learning Via Potential Field-Guided Flow Matching

Ding, Haoran	MBZUAI
Duan, Anqing	Mohamed Bin Zayed University of Artificial Intelligence
Sun, Zezhou	Mohamed Bin Zayed University of Artificial Intelligence
Rozo, Leonel	Bosch Center for Artificial Intelligence
Jaquier, Noémie	KTH Royal Institute of Technology
Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Nakamura, Yoshihiko	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation Abstract: Deep generative models, particularly diffusion and flow matching models, have recently shown remarkable potential in learning complex policies through imitation learning. However, the safety of generated motions remains overlooked, particularly in complex environments with inherent obstacles. In this work, we address this critical gap by proposing Potential Field-Guided Flow Matching Policy (PF2MP), a novel approach that simultaneously learns task policies and extracts obstaclerelated information, represented as a potential field, from the same set of successful demonstrations. During inference, PF2MP modulates the flow matching vector field via the learned potential field, enabling safe motion generation. By leveraging these complementary fields, our approach achieves improved safety without compromising task success across diverse environments, such as navigation tasks and robotic manipulation scenarios. We evaluate PF2MP in both simulation and real-world settings, demonstrating its effectiveness in task space and joint space control. Experimental results demonstrate that PF2MP enhances safety, achieving a significant reduction of collisions compared to baseline policies. This work paves the way for safer motion generation in unstructured and obstaclerich environments.

15:05-15:10, Paper WeCT12.2
Constrained Behavior Cloning for Robotic Learning

Jun, Xie	Northeastern University
Tan, Jianwei	Northeastern University
Liang, Wensheng	Northeastern University
Zhicheng, Wang	Northeastern University
Ma, Xiaoguang	Northeastern University
Keywords: Imitation Learning, Bioinspired Robot Learning, Deep Learning Methods Abstract: Behavior cloning (BC) is a widely used method for learning from expert demonstrations due to its simplicity and efficiency. However, the reliability and stability of BC decline when facing data distribution shifts, especially in single-arm robots with limited fields of view. This study introduces a Geometrically and Historically Constrained Behavior Cloning (GHCBC) method, where an HCBC module utilizes visual and action histories to capture temporal dependencies, maximizing the use of available information, and a GCBC module incorporates high-level perceptual data, such as the relative poses of joints and end-effectors, to enhance BC performance. Experiments demonstrate that the GHCBC outperforms current SOTA BC methods, achieving a 31.5% improvement in simulation success rates and 48.4% in real-robot scenarios respectively. To the best of our knowledge, this is the first time that the GHCBC has been introduced in robotic BC where great potential is demonstrated for long-term tasks in real world environments.

15:10-15:15, Paper WeCT12.3
SSFold: Learning to Fold Arbitrary Crumpled Cloth Using Graph Dynamics from Human Demonstration (I)

Zhou, Changshi	Tongji University
Xu, Haichuan	Tongji University
Hu, Jiarui	Tongji University
Luan, Feng	Tongji University
Wang, Zhipeng	Tongji University
Dong, Yanchao	Tongji University
Zhou, Yanmin	Tongji University
He, Bin	Tongji University
Keywords: Learning from Demonstration, Imitation Learning, Perception for Grasping and Manipulation Abstract: Robotic cloth manipulation poses significant challenges due to the fabric's complex dynamics and the high dimensionality of configuration spaces. Previous approaches have focused on isolated smoothing or folding tasks and relied heavily on simulations, often struggling to bridge the sim-to-real gap. This gap arises as simulated cloth dynamics fail to capture real-world properties such as elasticity, friction, and occlusions, causing accuracy loss and limited generalization. To tackle these challenges, we propose a two-stream architecture with sequential and spatial pathways, unifying smoothing and folding tasks into a single adaptable policy model. The sequential stream determines pick-and-place positions, while the spatial stream, using a connectivity dynamics model, constructs a visibility graph from partial point cloud data, enabling the model to infer the cloth's full configuration despite occlusions. To address the sim-to-real gap, we integrate real-world human demonstration data via a hand-tracking detection algorithm, enhancing real-world performance across diverse cloth configurations. Our method, validated on a UR5 robot across six distinct cloth folding tasks, consistently achieves desired folded states from arbitrary crumpled initial configurations, with success rates of 100.0%, 100.0%, 83.3%, 66.7%, 83.3%, and 66.7%. It outperforms state-of-the-art cloth manipulation techniques and generalizes to unseen fabrics with diverse colors, shapes, and stiffness.

15:15-15:20, Paper WeCT12.4
LEGATO: Cross-Embodiment Imitation Using a Grasping Tool

Seo, Mingyo	The University of Texas at Austin
Park, Hyungju Andy	RAI Institute
Yuan, Shenli	The Boston Dynamics AI Institute
Zhu, Yuke	The University of Texas at Austin
Sentis, Luis	The University of Texas at Austin
Keywords: Imitation Learning, Transfer Learning, Whole-Body Motion Planning and Control Abstract: Cross-embodiment imitation learning enables policies trained on specific embodiments to transfer across different robots, unlocking the potential for large-scale imitation learning that is both cost-effective and highly reusable. This paper presents LEGATO, a cross-embodiment imitation learning framework for visuomotor skill transfer across varied kinematic morphologies. We introduce a handheld gripper that unifies action and observation spaces, allowing tasks to be defined consistently across robots. We train visuomotor policies on task demonstrations using this gripper through imitation learning, applying transformation to a motion-invariant space for computing the training loss. Gripper motions generated by the policies are retargeted into high-degree-of-freedom whole-body motions using inverse kinematics for deployment across diverse embodiments. Our evaluations in simulation and real-robot experiments highlight the framework's effectiveness in learning and transferring visuomotor skills across various robots. More information can be found at the project page: https://ut-hcrl.github.io/LEGATO.

15:20-15:25, Paper WeCT12.5
Multi-Agent Generative Adversarial Interactive Self-Imitation Learning for AUV Formation Control and Obstacle Avoidance

Fang, Zheng	Ocean University of China
Chen, Tianhao	Ocean University of China
Shen, Tian	Ocean University of China
Jiang, Dong	Ocean University of China
Zhang, Zheng	Ocean University of China
Li, Guangliang	Ocean University of China
Keywords: Imitation Learning, Marine Robotics, Learning from Demonstration Abstract: Multiple autonomous underwater vehicles (multi-AUVs) can cooperatively accomplish tasks that a single AUV cannot complete. Recently, multi-agent reinforcement learning has been introduced to control of multi-AUV. However, design- ing efficient reward functions for various tasks of multi-AUV control is difficult or even impractical. Multi-agent generative adversarial imitation learning (MAGAIL) allows multi-AUV to learn from expert demonstration instead of pre-defined reward functions, but suffers from the deficiency of requiring optimal demonstrations and not surpassing provided expert demon- strations. This paper builds upon the MAGAIL algorithm by proposing multi-agent generative adversarial interactive self- imitation learning (MAGAISIL), which can facilitate AUVs to learn policies by gradually replacing the provided sub-optimal demonstrations with self-generated good trajectories selected by a human trainer. Our experimental results in three multi-AUV formation control and obstacle avoidance tasks on the Gazebo platform with AUV simulator of our lab show that AUVs trained via MAGAISIL can surpass the provided sub-optimal expert demonstrations and reach a performance close to or even better than MAGAIL with optimal demonstrations. Further results indicate that AUVs’ policies trained via MAGAISIL can adapt to complex and different tasks as well as MAGAIL learning from optimal demonstrations.

15:25-15:30, Paper WeCT12.6
Image-Driven Imitation Learning: Acquiring Expert Scanning Skills in Robotics Ultrasound (I)

Li, Jiaming	Guangdong University of Technology
Huang, Haohui	Guangdong University of Technology
Guo, Cong	Guangdong University of Technology
Lin, Qingguang	Sun Yat-Sen University Cancer Center
Guo, Jing	Guangdong University of Technology
Yang, Chenguang	University of Liverpool
Keywords: Imitation Learning, Surgical Robotics: Planning, Learning from Demonstration Abstract: A promising ultrasound (US) image acquisition requires experienced sonographers holding the probe with proper force and pose to ensure an excellent acoustic coupling. To enable a robotic ultrasound system (RUSS) to acquire the sonographers' skills from ultrasound image demonstrations, this paper proposes a cutting-edge framework that integrates an expert technique discrimination network and a robotic strategy generation network to learn expert scanning skills. In this framework, the expert technique discrimination network focuses on learning expert scanning techniques from the pre- and post-frame ultrasound images. Furthermore, to acquire expert scanning skills and obtain a standard image view, we design a knowledge-based algorithm grounded on inverse reinforcement learning (IRL) to generate a series of scanning policies concerning the expert technique discrimination network. Both simulations and experiments are conducted to validate the effectiveness of the proposed framework by comparing it with MI-GPSR and PTR. The scanning success rate and trajectory tracking error of the algorithm in the simulation environment are 68% and 12.0782, respectively, while in the phantom environment are 94% and 11.8367. The results demonstrate good performance in the task of imitating expert techniques for autonomous scanning.

15:30-15:35, Paper WeCT12.7
Neuromorphic Attitude Estimation and Control

Stroobants, Stein	University of Technology Delft
De Wagter, Christophe	Delft University of Technology
de Croon, Guido	TU Delft
Keywords: Imitation Learning, Neurorobotics, Machine Learning for Robot Control Abstract: The real-world application of small drones is mostly hampered by energy limitations. Neuromorphic computing promises extremely energy-efficient AI for autonomous flight but is still challenging to train and deploy on real robots. To reap the maximal benefits from neuromorphic computing, it is necessary to perform all autonomy functions end-to-end on a single neuromorphic chip, from low-level attitude control to high-level navigation. This research presents the first neuromorphic control system using a spiking neural network (SNN) to effectively map a drone's raw sensory input directly to motor commands. We apply this method to low-level attitude estimation and control for a quadrotor, deploying the SNN on a tiny Crazyflie. We propose a modular SNN, separately training and then merging estimation and control sub-networks. The SNN is trained with imitation learning, using a flight dataset of sensory-motor pairs. Post-training, the network is deployed on the Crazyflie, issuing control commands from sensor inputs at 500Hz. Furthermore, for the training procedure we augmented training data by flying a controller with additional excitation and time-shifting the target data to enhance the predictive capabilities of the SNN. On the real drone, the perception-to-control SNN tracks attitude commands with an average error of 3.0 degrees, compared to 2.7 degrees for the regular flight stack. We also show the benefits of the proposed learning modifications for reducing the average tracking error and reducing oscillations. Our work shows the feasibility of performing neuromorphic end-to-end control, laying the basis for highly energy-efficient and low-latency neuromorphic autopilots.

15:35-15:40, Paper WeCT12.8
Real-Time Iteration Scheme for Diffusion Policy

Duan, Yufei	KTH Royal Institute of Technology
Yin, Hang	University of Copenhagen
Kragic, Danica	KTH
Keywords: Imitation Learning, Machine Learning for Robot Control, Learning from Demonstration Abstract: Diffusion Policies have demonstrated impressive performance in robotic manipulation tasks. However, their long inference time, resulting from an extensive iterative denoising process, and the need to execute an action chunk before the next prediction to maintain consistent actions limit their applicability to latency-critical tasks or simple tasks with a short cycle time. While recent methods explored distillation or alternative policy structures to accelerate inference, these often demand additional training, which can be resource-intensive for large robotic models. In this paper, we introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a method from optimal control that accelerates optimization by leveraging solutions from previous time steps as initial guesses for subsequent iterations. We explore the application of this scheme in diffusion inference and propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation. The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign. This enables a seamless integration into many pre-trained diffusion-based models, in particular, to resource-demanding large models. We also provide theoretical conditions for the contractivity which could be useful for estimating the initial denoising step. Quantitative results from extensive simulation experiments show a substantial reduction in inference time, with comparable overall performance compared with Diffusion Policy using full-step denoising.


WeCT13	311C
Deep Learning for Visual Perception 7	Regular Session
Chair: Zhu, Dongchen	Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Science

15:00-15:05, Paper WeCT13.1
CooPre: Cooperative Pretraining for V2X Cooperative Perception

Zhao, Zhihao	University of California, Los Angeles
Xiang, Hao	University of California, Los Angeles
Xu, Chenfeng	University of California, Berkeley
Xia, Xin	University of California, Los Angeles
Zhou, Bolei	University of California, Los Angeles
Ma, Jiaqi	University of California, Los Angeles
Keywords: Deep Learning for Visual Perception, Representation Learning, Computer Vision for Transportation Abstract: Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning framwork for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perception performance. Specifically, multi-agent sensing information is aggregated to form a holistic view and a novel proxy task is formulated to reconstruct the LiDAR point clouds across multiple connected agents to better reason multi-agent spatial correlations. Besides, we develop a V2X bird-eye-view (BEV) guided masking strategy which effectively allows the model to pay attention to 3D features across heterogeneous V2X agents (i.e., vehicles and infrastructure) in the BEV space. Noticeably, such a masking strategy effectively pretrains the 3D encoder with a multi-agent LiDAR point cloud reconstruction objective and is compatible with mainstream cooperative perception backbones. Our approach, validated through extensive experiments on representative datasets (i.e., V2X-Real, V2V4Real, and OPV2V) and multiple state-of-the-art cooperative perception methods (i.e., AttFuse, F-Cooper, and V2X-ViT), leads to a performance boost across all V2X settings. Notably, CooPre achieves a 4% mAP improvement on V2X-Real dataset and surpasses baseline performance using only 50% of the training data, highlighting its data efficiency. Additionally, we demonstrate the framework's powerful performance in cross-domain transferability and robustness under challenging scenarios. The code will be made publicly available at https://github.com/ucla-mobility/CooPre.

15:05-15:10, Paper WeCT13.2
PCGE: Boosting 3D Visual Grounding Via Progressive Comprehension and Geometric-Topology Perception Enhancement

Wang, Zeyue	Bionic Vision System Laboratory, State Key Laboratory of Transdu
Xu, Xixia	Shanghai Institute of Microsystem and Information Technology, Ch
Liu, RunZe	Harbin Institute of Technology
Zhu, Dongchen	Shanghai Institute of Microsystem and Information Technology, Chi
Li, Jiamao	Shanghai Institute of Microsystem and Information Technology, Chi
Keywords: Deep Learning for Visual Perception Abstract: The 3D visual grounding task aims to establish correspondences between the 3D physical world and textual descriptions. Despite significant progress having been made, it still suffers from some challenges that need to be solved. a) Scene-agnostic text reasoning causes misaligned target region concentration. b) The regional pseudo-center interferences result in an inaccurate geometric center. c) Multi-modal features overemphasize semantics, leading to degradation in geometric topological perception for size regression. To address these issues, we creatively propose a Progressive Comprehension and Geometric-topology Perception Enhancement (PCGE) one-stage framework, which decouples the task into keypoint estimation and size regression under textual constraints. Specifically, to enable coarse-to-fine keypoint estimation, we propose the STAR module to focus the target region approximately with a scene-specific reasoning mechanism, while the K2C module performs geometric calibration to alleviate pseudo-center bias. For size regression, we propose GTE to enhance the geometric boundary perception during the decoding process, improving size regression via establishing topological matrices. Compared with previous methods, our approach achieves state-of-the-art performance on ScanRefer and Sr3D, with 3.94% leads of Acc@0.50 on ScanRefer, and 3.7% leads on Sr3D.

15:10-15:15, Paper WeCT13.3
Efficient Instance Motion-Aware Point Cloud Scene Prediction

Fang, Yiming	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Wang, Neng	National University of Defense Technology
Huang, Kaihong	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation Abstract: Point cloud prediction (PCP) aims to forecast future 3D point clouds of scenes by leveraging sequential historical LiDAR scans, offering a promising avenue to enhance the perceptual capabilities of autonomous systems. However, existing methods mostly adopt an end-to-end approach without explicitly modeling moving instances, limiting their effectiveness in dynamic real-world environments. In this paper, we propose IMPNet, a novel instance motion-aware network for future point cloud scene prediction. Unlike prior works, IMPNet explicitly incorporates motion and instance-level information to enhance PCP accuracy. Specifically, we extract appearance and motion features from range images and residual images using a dual-branch convolutional network and fuse them via a motion attention block. Our framework further integrates a motion head for identifying moving objects and an instance-assisted training strategy to improve instance-wise point cloud predictions. Extensive experiments on multiple datasets demonstrate that our proposed network achieves state-of-the-art performance in PCP with superior predictive accuracy and robust generalization across diverse driving scenarios. Our method has been released at https://github.com/nubot-nudt/IMPNet.

15:15-15:20, Paper WeCT13.4
NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding Based on Prompt-Guided Camera and 4D mmWave Radar

Guan, Runwei	University of Liverpool
Liu, Jianan	Momoni AI
Jia, Liye	University of Liverpool
Zhao, Haocheng	Xi'an Jiaotong-Liverpool University
Yao, Shanliang	YCIT
Zhu, Xiaohui	Xi'an Jiaotong-Liverpool University
Man, Ka Lok	Xi'an Jiaotong-Liverpool University
Lim, Eng Gee	Xi'an Jiaotong-Liverpool University
Smith, Jeremy S.	University of Liverpool
Yue, Yutao	Hong Kong University of Science and Technology (Guangzhou)
Keywords: Deep Learning for Visual Perception, Recognition, Semantic Scene Understanding Abstract: Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vessels (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments. Moreover, the real-world experiments with deployment of NanoMVG on embedded edge device of USV demonstrates its fast inference speed for real-time perception and capability of boasting ultra-low power consumption for long endurance.

15:20-15:25, Paper WeCT13.5
OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB

Lin, Yunzhi	Georgia Institute of Technology
Zhao, Yipu	Facebook Inc
Chu, Fu-Jen	Facebook AI Research
Chen, Xingyu	Meta
Wang, Weiyao	Meta
Tang, Hao	Meta Platforms Inc
Vela, Patricio	Georgia Institute of Technology
Feiszli, Matt	Meta FAIR
Liang, Kevin	Meta
Keywords: Deep Learning for Visual Perception, Visual Tracking, Computer Vision for Automation Abstract: To address the challenge of short-term object pose tracking in dynamic environments with monocular RGB input, we introduce a large-scale synthetic dataset OmniPose6D, crafted to mirror the diversity of real-world conditions. We additionally present a benchmarking framework for a comprehensive comparison of pose tracking algorithms. We propose a pipeline featuring an uncertainty-aware keypoint refinement network, employing probabilistic modeling to refine pose estimation. Comparative evaluations demonstrate that our approach achieves performance superior to existing baselines on real datasets, underscoring the effectiveness of our synthetic dataset and refinement technique in enhancing tracking precision in dynamic contexts. Our contributions set a new precedent for the development and assessment of object pose tracking methodologies in complex scenes.

15:25-15:30, Paper WeCT13.6
GaussianPU: Color Point Cloud Upsampling Via 3D Gaussian Splatting

Guo, Zixuan	Peking University
Xie, Yifan	Xi'an Jiaotong University
Xie, Weijing	Sun Yat-Sen University
Huang, Peng	Nanjing University
Chenyang, Wang	Shenzhen University
Ma, Fei	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Yu, F. Richard	Carleton University
Keywords: Deep Learning for Visual Perception, Visual Learning Abstract: Dense colored point clouds enhance visual perception and are of significant value in various robotic applications. However, existing learning-based point cloud upsampling methods are constrained by computational resources and batch processing strategies, which often require subdividing point clouds into smaller patches, leading to distortions that degrade perceptual quality. To address this challenge, we propose a novel 2D-3D hybrid colored point cloud upsampling framework (GaussianPU) based on 3D Gaussian Splatting (3DGS) for robotic perception. This approach leverages 3DGS to bridge 3D point clouds with their 2D rendered images in robot vision systems. A dual scale rendered image restoration network transforms sparse point cloud renderings into dense representations, which are then input into 3DGS along with precise robot camera poses and interpolated sparse point clouds to reconstruct dense 3D point clouds. We have made a series of enhancements to the vanilla 3DGS, enabling precise control over the number of points and significantly boosting the quality of the upsampled point cloud for robotic scene understanding. Our framework supports processing entire point clouds on a single consumer-grade GPU, eliminating the need for segmentation and thus producing high-quality, dense colored point clouds with millions of points for robot navigation and manipulation tasks. Extensive experimental results on generating million-level point cloud data validate the effectiveness of our method, substantially improving the quality of colored point clouds and demonstrating significant potential for applications involving large-scale point clouds in autonomous robotics and human-robot interaction scenarios.

15:30-15:35, Paper WeCT13.7
ViewActive: Active Viewpoint Optimization from a Single Image

Wu, Jiayi	University of Maryland, College Park
Lin, Xiaomin	Johns Hopkins University
He, Botao	University of Maryland
Fermüller, Cornelia	University of Maryland
Aloimonos, Yiannis	University of Maryland
Keywords: Deep Learning for Visual Perception, Visual Learning, Cognitive Modeling Abstract: When observing objects, humans benefit from their spatial visualization and mental rotation ability to envision potential optimal viewpoints based on the current observation. This capability is crucial for enabling robots to achieve efficient and robust scene perception during operation, as optimal viewpoints provide essential and informative features for accurately representing scenes in 2D images, thereby enhancing downstream tasks. To endow robots with this human-like active viewpoint optimization capability, we propose ViewActive, a modernized machine learning approach drawing inspiration from aspect graph, which provides viewpoint optimization guidance based solely on the current 2D image input. Specifically, we introduce the 3D Viewpoint Quality Field (VQF), a compact and consistent representation for viewpoint quality distribution similar to an aspect graph, composed of three general-purpose viewpoint quality metrics: self-occlusion ratio, occupancy-aware surface normal entropy, and visual entropy. We utilize pre-trained image encoders to extract robust visual and semantic features, which are then decoded into the 3D VQF, allowing our model to generalize effectively across diverse objects, including unseen categories. The lightweight ViewActive network (72 FPS on a single GPU) significantly enhances the performance of state-of-the-art object recognition pipelines and can be integrated into real-time motion planning for robotic applications. Our code and dataset are available here https://github.com/jiayi-wu-umd/ViewActive.

15:35-15:40, Paper WeCT13.8
Towards Physically Realizable Adversarial Attacks in Embodied Vision Navigation

Meng, Chen	Beijing University of Posts and Telecommunications
Tu, Jiawei	Beijing University of Posts and Telecommunications
Qi, Chao	Beijing University of Posts and Telecommunications
Dang, Yonghao	Beijing University of Posts and Telecommunications
Zhou, Feng	Beijing University of Posts and Telecommunications
Wei, Wei	Beijing University of Posts and Telecommunication
Jianqin, Yin	School of Artificial Intelligence, Beijing University of Posts A
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Vision-Based Navigation Abstract: The significant advancements in embodied vision navigation have raised concerns about its susceptibility to adversarial attacks exploiting deep neural networks. Investigating the adversarial robustness of embodied vision navigation is crucial, especially given the threat of 3D physical attacks that could pose risks to human safety. However, existing attack methods for embodied vision navigation often lack physical feasibility due to challenges in transferring digital perturbations into the physical world. Moreover, current physical attacks for object detection struggle to achieve both multi-view effectiveness and visual naturalness in navigation scenarios. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches to objects, where both opacity and textures are learnable. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which optimizes the patch's texture based on feedback from the vision-based perception model used in navigation. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, in which opacity is fine-tuned after texture optimization. Experimental results demonstrate that our adversarial patches decrease the navigation success rate by an average of 22.39%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: https://github.com/chen37058/Physical-Attacks-in-Embodied-N av.


WeCT14	311D
Learning from Demonstration 3	Regular Session
Chair: Cheng, Long	Chinese Academy of Sciences
Co-Chair: Zhou, Xuefeng	Institute of Intelligent Manufacturing, Guangdong Academy of Science

15:00-15:05, Paper WeCT14.1
Neural-Lyapunov Fusion: Stable Dynamical System Learning for Robotic Motion Generation

Zhang, Haoyu	Institute of Automation, Chinese Academy of Sciences
Zhang, Yu	University of Chinese Academy of Sciences
Zou, Yongxiang	Institute of Automation, Chinese Academy of Sciences
Li, Houcheng	University of Chinese Academy of Sciences
Cheng, Long	Chinese Academy of Sciences
Keywords: Learning from Demonstration, Learning from Experience, Imitation Learning Abstract: Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (ADS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presenting a novel ADS algorithm that leverages neural network technology. The proposed algorithm is designed to distill essential knowledge from demonstration data, ensuring stability during the learning of both point-to-point and periodic motions. For point-to-point motions, a neural Lyapunov function is proposed to align with the provided demonstrations. In the case of periodic motions, the neural Lyapunov function is used with the transversal contraction to ensure that all generated motions converge to a stable limit cycle. The model utilizes a streamlined neural network architecture, adept at achieving dual objectives: optimizing learning accuracy while maintaining global stability. To thoroughly assess the efficacy of the proposed algorithm, rigorous evaluations are conducted using the LASA dataset and a massage robot task. The assessments were complemented by empirical validation, providing evidence of the algorithm's performance.

15:05-15:10, Paper WeCT14.2
Learning Target-Directed Skill and Variable Impedance Control from Interactive Demonstrations for Robot-Assisted Soft Tissue Puncture Tasks (I)

Zhai, Xueqian	Wuyi University
Jiang, Li	South China University of Technology
Wu, Hongmin	Institute of Intelligent Manufacturing, Guangdong Academy of Sci
Zheng, Haochen	Wuyi University
Liu, Dong	Institute of Intelligent Manufacturing, Guangdong Academy of Sci
Wu, Xinyu	CAS
Zhihao, Xu	GIIM
Zhou, Xuefeng	Institute of Intelligent Manufacturing, Guangdong Academy of Sci
Keywords: Learning from Demonstration, Compliance and Impedance Control, Motion and Path Planning Abstract: A framework is proposed in this paper for learning variable impedance in percutaneous puncture surgery, with the aim of simplifying the robotic puncture of soft tissues. The framework involves simulating the dynamic changes that occur when the human arm interacts with human tissues and transferring the resulting adaptive capabilities to the robot through learning movement trends and stiffness changes. To enhance performance during task execution, we integrate the variable impedance control framework with the interactive operation and feedback controllers. To provide flexibility for trajectory modification during operation, derivative Gaussian processes are introduced to identify the target position and obtain a model of motion trends. This control law is combined with virtual dynamics that describe puncture dynamics, enabling the robot to regulate interactions and plan its trajectory. We present experiments involving tissue puncturing tasks performed by the Franka-Emika Panda robot with varying degrees of hardness. The results demonstrate that our framework is capable of learning manipulation skills for physical interaction with humans, thereby reducing application complexity in tasks involving complex force interactions for robots. Compared to using fixed or variable impedance gain controllers, our approach effectively improves the success rate, stability, and efficiency of percutaneous puncture.

15:10-15:15, Paper WeCT14.3
Simultaneously Learning of Motion, Stiffness, and Force from Human Demonstration Based on Riemannian DMP and QP Optimization (I)

Liao, Zhiwei	Xi'an Jiaotong University
Tassi, Francesco	Istituto Italiano Di Tecnologia
Gong, Chenwei	Xi'an Jiaotong University
Leonori, Mattia	Istituto Italiano Di Tecnologia
Zhao, Fei	Xi'an Jiaotong University
Jiang, Gedong	State Key Laboratory for Manufacturing Systems Engineering Xi'an
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Learning from Demonstration, Imitation Learning, Compliance and Impedance Control Abstract: In this paper, we propose a motion, stiffness, and force learning framework based on an extended dynamic movement primitive (DMP) and quadratic programming (QP) optimization. The objective is to learn kinematic and dynamic operational parameters from a one-shot human demonstration, through measurement and estimation of the motion, 3-dimensional (3-D) endpoint stiffness, and applied forces of the human arm during manipulation tasks. To this end, first, the framework features an extended DMP to model the motion, stiffness, and force variations in Cartesian space and 2-D sphere manifold. Second, to account for collected errors and human-robot operation gaps, a QP optimization is applied to fine-tune the desired position of the controller. Finally, we validate the framework through two experiments in real scenarios on the Franka Emika Panda robot. Experimental results show that the robot can not only inherit the variation laws of motion, stiffness, and force in the human demonstration, but also exhibit certain generalization capabilities to other situations. The framework provides a reference for robots learning multiple skills via a one-shot human demonstration, which finds great potential application in human-robot cooperation, contact-rich scenarios, and skillful operations, where the motion, stiffness, and applied forces need to be considered simultaneously

15:15-15:20, Paper WeCT14.4
PI2-BDMPs in Combination with Contact Force Model: A Robotic Polishing Skill Learning and Generalization Approach (I)

Wang, Yu	Huazhong University of Science and Technology
Chen, Chen	Wuhan University of Science and Technology
Hong, Yong	Huazhong University of Science and Technology
Zheng, Zhouyi	Huazhong University of Science and Technology
Gao, Zhitao	Huazhong University of Science and Technology
Peng, Fangyu	Huazhong University of Science and Technology
Yan, Rong	Huazhong University of Science and Technology
Tang, Xiaowei	Huazhong University of Science and Technology
Keywords: Learning from Demonstration, Imitation Learning, Reinforcement Learning Abstract: Robot skill learning and generalization in robotic polishing tasks play a pivotal role in efficient task planning, particularly in tasks where visual scanning is not convenient. Compared to other manipulations, skill learning of robot polishing trajectories and skill generalization between similar trajectories require high accuracy, and skill learning under a small number of basis function settings is also worth studying. To address these challenges and consider both the learning accuracy and the number of basis function of skill model, this study introduces a novel method known as B-spline dynamic movement primitives(BDMPs) in the context of robotic polishing tasks. BDMPs demonstrate superior trajectory learning accuracy even with a small number of basis functions compared to dynamic movement primitives. In addition, to facilitate trajectory generalization in new scenarios, a trajectory skill generalization framework based on policy improvement with path integrals (PI2) is proposed, resulting in an integrated approach named PI2-BDMPs. The PI2-BDMPs approach is designed to enable the generalization of polished trajectory skills to meet new task requirements, even when provided with limited demonstration points and a small number of basis functions. In order to enhance the generalization performance of the skill in the contact force setting task, a PI2 approach incorporating a contact force model is proposed. To validate the effectiveness of our proposed method, the real-world robot polishing experiments are conducted with an adaptive variable impedance controller. The results verify that BDMPs offer superior trajectory learning accuracy with few basis functions, and PI2-BDMPs facilitate the generalization of polished trajectory skills to new scenarios, showcasing the adaptability of our approach.

15:20-15:25, Paper WeCT14.5
Storm: An Experience-Based Framework for Robot Learning from Demonstration

Quiroga, Natalia	Hochschule Bonn-Rhein-Sieg (H-BRS)
Mitrevski, Alex	Hochschule Bonn-Rhein-Sieg
Plöger, Paul G.	Hochschule Bonn Rhein Sieg
Hassan, Teena	Bonn-Rhein-Sieg University of Applied Sciences
Keywords: Learning from Demonstration, Learning from Experience, Developmental Robotics Abstract: Learning from demonstration (LfD) can be used to increase the behavioural repertoire of a robot, but most demonstration-based learning techniques do not enable a robot to acquire knowledge about the limitations of its own body and use that information during learning. In this paper, we propose Storm, an LfD framework that enables acquiring trajectories in high-dimensional spaces, incorporates collision awareness, and can be adapted to different robots. Storm combines a collection of modules: (i) robot embodiment exploration using motor babbling in order to acquire knowledge about the robot's own body, stored in the form of joint-specific graphs that encode reachable points and reachability constraints, (ii) human-robot model mapping based on which human skeleton observations are mapped to the robot's embodiment, and (iii) demonstration-based trajectory learning and subsequent reproduction of the learned actions using Gaussian mixture regression. We validate various aspects of our approach experimentally: (i) exploration with different numbers of babbling points for three distinct robots, (ii) path planning performance, including in the presence of obstacles, and (iii) the acceptance of reproduced trajectories through a small-scale, real-world user study. The results demonstrate that Storm can produce versatile behaviours on different robots, and that trajectory reproductions are generally rated well by external observers, which is important for overall user acceptance.

15:25-15:30, Paper WeCT14.6
Positive-Unlabeled Constraint Learning for Inferring Nonlinear Continuous Constraints Functions from Expert Demonstrations

Peng, Baiyu	EPFL
Billard, Aude	EPFL
Keywords: Learning from Demonstration, Machine Learning for Robot Control, Transfer Learning Abstract: Planning for diverse real-world robotic tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. This paper presents a novel two-step Positive-Unlabeled Constraint Learning (PUCL) algorithm to infer a continuous arbitrary constraint function from demonstrations, without requiring prior knowledge of the true constraint parameterization or environmental model as existing works. We treat all data in demonstrations as positive (feasible) data, and learn a control policy to generate potentially infeasible trajectories, which serve as unlabeled data. The proposed two-step learning framework first identifies reliable infeasible data using a distance metric, and secondly learns a binary feasibility classifier (i.e., constraint function) from the feasible demonstrations and reliable infeasible data. The proposed method is flexible to learn complex-shaped constraint boundary and will not mistakenly classify demonstrations as infeasible as previous methods. The effectiveness of the proposed method is verified in four constrained environments, using a networked policy or a dynamical system policy. It successfully infers the continuous nonlinear constraints and outperforms other baseline methods in terms of constraint accuracy and policy safety.

15:30-15:35, Paper WeCT14.7
Human-Humanoid Robots Cross-Embodiment Behavior-Skill Transfer Using Decomposed Adversarial Learning from Demonstration

Liu, Junjia	The Chinese University of Hong Kong
Li, Zhuo	The Chinese University of Hong Kong
Yu, Minghao	The Chinese University of Hong Kong
Dong, Zhipeng	The Chinese University of Hong Kong
Calinon, Sylvain	Idiap Research Institute
Caldwell, Darwin G.	Istituto Italiano Di Tecnologia
Chen, Fei	T-Stone Robotics Institute, the Chinese University of Hong Kong
Keywords: Learning from Demonstration, Mobile Manipulation, Human and Humanoid Motion Analysis and Synthesis Abstract: Humanoid robots are envisioned as embodied intelligent entities capable of performing diverse human-level loco-manipulation tasks. However, effectively transferring human embodiment manipulation skills to humanoid robots poses significant challenges due to their differing configurations, dynamics, and coordination requirements across their high degree of freedom. In this work, we propose an efficient and transferable framework for transferring human whole-body loco-manipulation skills to various heterogeneous humanoid robots. We introduce a unified digital human model as a prototype embodiment to learn behavior primitives through adversarial imitation using human demonstrations. The high degree of freedom in humanoid robots is decomposed into multiple functional parts, where behavior primitives are learned separately and then coordinated. Loco-manipulation skills with generalized task parameters are achieved by dynamically combining these primitives using a human-object interaction graph as guidance. For each humanoid robot, the same loco-manipulation can be learned via embodiment-specific kinematic motion retargeting and dynamic fine-tuning.

15:35-15:40, Paper WeCT14.8
Learning Rhythmic Trajectories with Geometric Constraints for Laser-Based Skincare Procedures

Duan, Anqing	Mohamed Bin Zayed University of Artificial Intelligence
Liuchen, Wanli	The Hong Kong Polytechnic University
Wu, Jinsong	Hong Kong Polytechnic University
Camoriano, Raffaello	Politecnico Di Torino
Rosasco, Lorenzo	Istituto Italiano Di Tecnologia & MassachusettsInstitute OfTechn
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Learning from Demonstration, Motion and Path Planning, Learning and Adaptive Systems, Robotic Cosmetic Dermatology Abstract: The increasing deployment of robots has significantly enhanced the automation levels across a wide and diverse range of industries. This paper investigates the automation challenges of laser-based dermatology procedures in the beauty industry; This group of related manipulation tasks involves delivering energy from a cosmetic laser onto the skin with repetitive patterns. To automate this procedure, we propose to use a robotic manipulator and endow it with the dexterity of a skilled dermatology practitioner through a learning-from-demonstration framework. To ensure that the cosmetic laser can properly deliver the energy onto the skin surface of an individual, we develop a novel structured prediction-based imitation learning algorithm with the merit of handling geometric constraints. Notably, our proposed algorithm effectively tackles the imitation challenges associated with quasi-periodic motions, a common feature of many laser-based cosmetic tasks. The conducted real-world experiments illustrate the performance of our robotic beautician in mimicking realistic dermatological procedures; Our new method is shown to not only replicate the rhythmic movements from the provided de


WeCT15	206
Autonomous Vehicle Navigation 1	Regular Session
Chair: Wang, Yue	Zhejiang University
Co-Chair: Ou, Yongsheng	Dalian University of Technology

15:00-15:05, Paper WeCT15.1
Safe Motion Planning for Multi-Vehicle Autonomous Driving in Uncertain Environment

Lei, Zhezhi	National University of Singapore
Wang, Wenxin	National University of Singapore
Zhu, Zicheng	National University of Singapore
Ma, Jun	The Hong Kong University of Science and Technology
Ge, Shuzhi Sam	National University of Singapore
Keywords: Autonomous Vehicle Navigation, Collision Avoidance, Planning under Uncertainty Abstract: In the field of motion planning for autonomous driving systems, ensuring the safety of multi-vehicle navigation is one of the crucial topics. An unavoidable problem in practice is that the noise-induced uncertainties in real-world applications highly degrade the safety of multi-vehicle navigation. It is also challenging to guarantee the required computation efficiency of motion planning algorithms in such uncertain environments. In this work, we present a novel motion planning framework to enhance the safety and computation efficiency of multi-vehicle navigation. This framework utilizes the iterative linear quadratic Gaussian (iLQG) algorithm to deal with the nonlinearity of the vehicle dynamics and overcomes the difficulties in handling inequality constraints (e.g., collision avoidance constraints). Furthermore, we propose an innovative Alternating direction method of multipliers based Linearized Chance Constraint (ALCC) method to address collision constraints in noisy uncertain environments. Simulation experimental results demonstrate that our method achieves higher safety with high computational efficiency compared to other methods in various multi-vehicle motion planning and navigation scenarios.

15:05-15:10, Paper WeCT15.2
On Robust Context-Aware Navigation for Autonomous Ground Vehicles

Forte, Paolo	Örebro University
Gupta, Himanshu	Örebro University
Andreasson, Henrik	Örebro University
Köckemann, Uwe	Orebro Universitet
Lilienthal, Achim J.	Orebro University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Robotics and Automation in Construction Abstract: We propose a context-aware navigation framework designed to support the navigation of autonomous ground vehicles, including articulated ones. The proposed framework employs a behavior tree with novel nodes to manage the navigation tasks: planner and controller selections, path planning, path following, and recovery. It incorporates a weather detection system and configurable global path planning and controller strategy selectors implemented as behavior tree action nodes. These components are integrated into a sub-tree that supervises and manages available options and parameters for global planners and control strategies by evaluating map and real-time sensor data. The proposed approach offers three key benefits: overcoming the limitations of single planner strategies in challenging scenarios; ensuring efficient path planning by balancing between optimization and computational effort; and achieving smoother navigation by reducing path curvature and improving drivability. The performance of the proposed framework is analyzed empirically, and compared against state of the art navigation systems with single path planning strategies.

15:10-15:15, Paper WeCT15.3
On Lie Group IMU and Linear Velocity Preintegration for Autonomous Navigation Considering the Earth Rotation Compensation

Vial, Pau	Universitat De Girona ESQ6750002E
Solà, Joan	Institut De Robòtica I Informàtica Industrial
Palomeras, Narcis	Universitat De Girona
Carreras, Marc	Universitat De Girona
Keywords: Autonomous Vehicle Navigation, Kinematics, Marine Robotics, Lie Theory Abstract: Robot localization is a fundamental task in achieving true autonomy. Recently, many graph-based navigators have been proposed that combine an inertial measurement unit (IMU) with an exteroceptive sensor applying IMU preintegration to synchronize both sensors. IMUs are affected by biases that also have to be estimated. To increase the navigator robustness when faults appear on the perception system, IMU preintegration can be complemented with linear velocity measurements obtained from visual odometry, leg odometry, or a Doppler Velocity Log (DVL), depending on the robotic application. Moreover, higher grade IMUs are sensitive to the Earth rotation rate, which must be compensated in the preintegrated measurements. In this article, we propose a general purpose preintegration methodology formulated on a compact Lie group to set motion constraints on graph simultaneous localization and mapping problems considering the Earth rotation effect. We introduce the SEn(3) group to jointly preintegrate IMU data and linear velocity measurements to preserve all the existing correlation within the preintegrated quantity. Field experiments using an autonomous underwater vehicle equipped with a DVL and a navigational grade IMU are provided and results are benchmarked against a commercial filter-based inertial navigation system to prove the effectiveness of our methodology.

15:15-15:20, Paper WeCT15.4
Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments (I)

Jiao, Jianhao	University College London
Geng, Ruoyu	Hong Kong University of Science and Technology
Li, Yuanhang	The Hong Kong University of Science and Technology
Xin, Ren	The Hong Kong University of Science and Technology
Yang, Bowen	The Hong Kong University of Science and Technology, Robotics Ins
Wu, Jin	University of Science and Technology Beijing
Wang, Lujia	The Hong Kong University of Technology (Guangzhou)
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Fan, Rui	Tongji University
Kanoulas, Dimitrios	University College London
Keywords: Autonomous Vehicle Navigation, Mapping, Semantic Scene Understanding Abstract: The creation of a metric-semantic map, which encodes human-prior knowledge, represents a high-level abstrac-tion of environments. However, constructing such a map poses challenges related to the fusion of multi-modal sensor data, the attainment of real-time mapping performance, and the preserva-tion of structural and semantic information consistency. In this paper, we introduce an online metric-semantic mapping system that utilizes LiDAR-Visual-Inertial sensing to generate a global metric-semantic mesh map of large-scale outdoor environments. Leveraging GPU acceleration, our mapping process achieves exceptional speed, with frame processing taking less than 7ms, regardless of scenario scale. Furthermore, we seamlessly integrate the resultant map into a real-world navigation system, enabling metric-semantic-based terrain assessment and autonomous point-to-point navigation within a campus environment. Through extensive experiments conducted on both publicly available and self-collected datasets comprising 24 sequences, we demonstrate the effectiveness of our mapping and navigation methodologies.

15:20-15:25, Paper WeCT15.5
LIGO: A Tightly Coupled LiDAR-Inertial-GNSS Odometry Based on a Hierarchy Fusion Framework for Global Localization with Real-Time Mapping

He, Dongjiao	The University of Hong Kong
Li, Haotian	The University of Hong Kong
Yin, Jie	Shanghai Jiao Tong University
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Field Robots, Hierarchy framework Abstract: This paper introduces a method for tightly fusing sensors with diverse characteristics to maximize their complementary properties, thereby surpassing the performance of individual components. Specifically, we propose a tightly coupled LiDAR-Inertial-GNSS Odometry (LIGO) system, which synthesizes the advantages of LiDAR, IMU, and GNSS. LIGO employs an innovative hierarchical fusion approach with both front-end and back-end components to achieve synergistic performance. The front-end of LIGO utilizes a tightly coupled, EKF-based LiDAR-Inertial system for high-bandwidth localization and real-time mapping within a local-world frame. The back-end tightly integrates the filtered compact LiDAR-Inertial factors from the front-end with GNSS observations in an extensive factor graph, being more robust to outliers and noises in GNSS observations and producing optimized globally-referenced state estimates. These optimized results are then fed back to the front-end through the EKF to ensure a drift-free trajectory. Real-world experiments validate the effectiveness of LIGO, especially when applied to UAVs, demonstrating its resilience to signal losses and LiDAR degeneracy.

15:30-15:35, Paper WeCT15.7
Robust Second-Order LiDAR Bundle Adjustment Algorithm Using Mean Squared Group Metric (I)

Ma, Tingchen	Shenzhen Institute of Advanced Technology，Chinese Academy
Ou, Yongsheng	Dalian University of Technology
Xu, Sheng	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Keywords: SLAM, Vision-Based Navigation, Autonomous Vehicle Navigation Abstract: The bundle adjustment (BA) algorithm is a widely used nonlinear optimization technique in simultaneous localization and mapping (SLAM) systems. By leveraging the co-view relationships of landmarks from multiple perspectives, the BA method constructs a joint estimation model for both poses and landmarks, enabling the system to generate refined maps and reduce front-end localization errors. However, exploring a robust LiDAR BA estimator and achieving accurate solutions is a challenge. In this work, firstly we propose a novel mean square group metric (MSGM) to build the optimization objective of the LiDAR BA algorithm. This metric applies a mean square transformation to uniformly process the measurements of plane landmarks during one sampling period. The transformed metric ensures scale interpretability and does not require a time-consuming point-by-point calculation. Secondly, by integrating a robust kernel function, the metrics involved in the BA algorithm are reweighted, thus enhancing the robustness of the solution process. Thirdly, based on the proposed robust LiDAR BA model, we derived an explicit second-order estimator (RSO-BA). This estimator employs analytical formulas for Hessian and gradient calculations, ensuring the precision of the BA solution. Finally, we verify the merits of the proposed RSO-BA estimator against existing implicit second-order and explicit approximate second-order estimators using publicly available datasets and physical experiments. The experimental results demonstrate that the RSO-BA estimator outperforms its counterparts in terms of registration accuracy and robustness, particularly in dynamic or complex unstructured environments.


WeCT16	207
Prosthetics and Exoskeletons 3	Regular Session
Chair: Song, Jiyuan	Guangming Laboratory, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Co-Chair: Wang, Xiangyang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (CAS)

15:00-15:05, Paper WeCT16.1
Thermal Characteristic Modeling and Compensation for the Improvement of Actuator Homeostasis (I)

Youn, Jimin	KAIST
Kim, Hyeongjun	Korea Advanced Institute of Science and Technology
Shi, Kyeongsu	Angel Robotics
Kong, Kyoungchul	Korea Advanced Institute of Science and Technology
Keywords: Prosthetics and Exoskeletons, Actuation and Joint Mechanisms, Force Control Abstract: The operation of electric actuators across a wide temperature spectrum poses a formidable challenge in maintaining actuator homeostasis—the ability to generate a consistent response for a given input. This challenge arises mainly due to torque constant variations resulting from changes in magnetic flux density with temperature fluctuations. This study introduces a novel method to predict and compensate for these variations by developing a thermal model for the actuator, which allows for real-time estimation of the temperature of the inaccessible rotating magnet for effective compensation. The research seeks to advance actuator homeostasis beyond conventional methods that rely solely on the temperature of static components, such as the stator or housing. The effectiveness of the proposed algorithm is verified through comparison with the conventional open-loop torque control algorithm. In addition, the stability of the closed-loop system, focusing on temperature convergence with the proposed algorithm, is analyzed. This methodology suggests a promising path for developing drive systems that maintain actuator homeostasis in diverse conditions, addressing the root causes of system variability.

15:05-15:10, Paper WeCT16.2
Plantar Flexion Muscle Force Estimation with a Soft Wearable Pneumatic Sensor System (I)

Kim, Taeyeon	Korea Advanced Institute of Science and Technology
Kong, Kyoungchul	Korea Advanced Institute of Science and Technology
Keywords: Prosthetics and Exoskeletons, Human Performance Augmentation, Rehabilitation Robotics Abstract: In numerous fields that analyze the human body from a biomechanical perspective, the force of muscles could only be indirectly estimated through multiple hierarchical levels, and the limitations in terms of accuracy were clear for each methodology. Accordingly, this study introduces a novel, noninvasive method for estimating muscle force, leveraging the phenomenon that transverse muscle stiffness increases under more longitudinal tension. By manufacturing the soft pneumatic sensor system for plantar flexor, the two-step model that can fully describe the interaction between the air cell and muscle is developed. Initially, equations incorporating geometric constraints and force equilibrium are derived to calculate the muscle's deformation depth from the measured pressure. The correlation between muscle fiber deformation and the transverse reaction force is then identified. Using the proposed model, pressure measurement from sensor system is converted to force estimate. Experimental validation demonstrates its high estimation accuracy, supporting the effectiveness of the proposed model-based approach. This methodology shows promise for diverse fields that require noninvasive and accurate plantar flexor muscle force estimation.

15:10-15:15, Paper WeCT16.3
Feature Matching-Based Gait Phase Prediction for Obstacle Crossing Control of Powered Transfemoral Prosthesis

Zhang, Jiaxuan	Southern University of Science and Technology
Leng, Yuquan	Harbin Institute of Technology (Shenzhen)
Guo, Yixuan	Southern University of Science and Technology
Fu, Chenglong	Southern University of Science and Technology (SUSTech)
Keywords: Prosthetics and Exoskeletons, Optimization and Optimal Control, Deep Learning Methods Abstract: For amputees with powered transfemoral prosthet- ics, navigating obstacles or complex terrain remains challenging. This study addresses this issue by using an inertial sensor on the sound ankle to guide obstacle-crossing movements. A genetic algorithm computes the optimal neural network structure to predict the required angles of the thigh and knee joints. A gait progression prediction algorithm determines the actuation angle index for the prosthetic knee motor, ultimately defining the necessary thigh and knee angles and gait progression. Results show that when the standard deviation of Gaussian noise added to the thigh angle data is less than 1, the method can effectively eliminate noise interference, achieving 100% accuracy in gait phase estimation under 150 Hz, with thigh angle prediction error being 8.71% and knee angle prediction error being 6.78%. These findings demonstrate the method’s ability to accurately predict gait progression and joint angles, offering significant practical value for obstacle negotiation in powered transfemoral prosthetics. javascript:command('Cancel')

15:15-15:20, Paper WeCT16.4
A Deep Reinforcement Learning Based End-To-End Control Framework for Lower Limb Exoskeletons with Smooth Movement Transitions

Kim, Minsu	Seoul National University
Baek, Woo-Jeong	Seoul National University
Park, Jaeheung	Seoul National University
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Rehabilitation Robotics Abstract: This paper presents an active control strategy for lower limb exoskeletons by proposing an end-to-end framework employing deep reinforcement learning (DRL) to enable smooth transitions between different movement patterns. The majority of existing methods in exoskeleton literature employ finite state machines (FSM) that have proven successful in predicting the control strategy for the next state on the basis of sensor data such as IMU data, force, etc. However, one drawback of FSM occurs due to their inflexibility regarding sudden changes. Specifically, FSM is based on clear state transitions, which makes it hard to manage smooth continuous movements and increases the chance of sudden changes in control during transitions. These, in turn, raise safety concerns for the user. While learning-based control approaches have been suggested in recent years, the validation was performed in simulation environments. Therefore, the real-world applicability remains an open research question to date. To address this issue, we provide the first contribution in this field that proposes an end-to-end learning framework with a Deep Deterministic Policy Gradient (DDPG) module to enable smooth transitions between movement patterns under real-world conditions. By introducing several evaluation metrics, we demonstrate that our framework outperforms existing methods in terms of the adaptability and smoothness in movement transitions.

15:20-15:25, Paper WeCT16.5
The Anti-Misalignment Mechanism of Bionic Knee Joint of Lower Limb Exoskeleton Based on Spherical Cross Four-Bar

He, Ye	Chongqing University
Wu, Jiaxun	Chongqing University
Chen, Tianchi	Chongqing University
Liu, Zhi	Chongqing University
Zhang, Hongyuan	Chongqing University
Xia, Rufei	Chongqing University
Keywords: Prosthetics and Exoskeletons, Biomimetics Abstract: To minimize discomfort and injury risk in exoskeleton users, this paper addresses the misalignment between the device and the human knee joint. The knee's spatial motion complexity, characterized by multi-planar rotation axes as flexion angle changes, cannot be accurately replicated by existing single-axis or planar multi-center designs. A novel spherical cross four-bar linkage-based knee joint structure is proposed, leveraging its kinematic properties to mimic the knee's actual spatial motion. This design undergoes optimization calculations to determine the bionic knee's specific structure. A quantitative evaluation method using pneumatic sensor pads measures internal pressure distribution, comparing human-machine misalignment across different joint mechanisms. Experimental results demonstrate that the bionic knee significantly reduces unintended interaction forces, with maximum pressure values only one-third those of single-axis knee joints. This innovation addresses critical limitations in existing exoskeleton knee designs, enhancing comfort and safety during movement.

15:25-15:30, Paper WeCT16.6
Transfer Learning for Walking Speed Estimation across Novel Prosthetic Devices and Populations

Maldonado-Contreras, Jairo	Georgia Institute of Technology
Johnson, Cole	Georgia Institute of Technology
Knight, Ian	University of Pennsylvania
Sawant, Aarnav	Georgia Institute of Technology
Zhou, Sixu	Georgia Institute of Technology
Kim, Hanjun	Georgia Institute of Technology
Herrin, Kinsey	Georgia Institute of Technology
Young, Aaron	Georgia Tech
Keywords: Prosthetics and Exoskeletons, Transfer Learning, Machine Learning for Robot Control Abstract: Accurate walking speed estimation in lower-limb prostheses is crucial for delivering biomechanically appropriate assistance across varying speeds. However, training robust models requires extensive domain-specific, user-dependent (DEP) data, which is impractical for every new prosthesis user. This study presents a transfer learning framework to simplify and enhance the training process. Convolutional neural networks were pre-trained on publicly available datasets from able-bodied (AB) individuals and transfemoral amputees using the Open Source Leg (OSL) knee-ankle prosthesis, then fine-tuned with data from a transfemoral amputee using the Power Knee (PK) prosthesis. The fine-tuned models, AB-PK and OSL-PK were trained with varying data amounts and evaluated across constant and variable walking speed trials, with performance compared to DEP models trained from scratch on PK data. Training and testing were conducted on a per-subject basis, with performance averaged across subjects (N=7). The lowest post-fine-tuning error was observed in AB-PK, with RMSE values of 0.041 m/s for constant speeds, 0.072 m/s for variable speeds, and 0.088 m/s for novel speeds not included in the original training data. Significant error reductions were observed in both fine-tuned models compared to DEP when fewer than 30 gait cycles per speed of training data were available. Notably, AB datasets appeared highly viable for this application and may even outperform OSL datasets in transfer learning for walking speed estimation, perhaps due to the much larger original training dataset. This approach highlights the potential of transfer learning across different subject populations and devices, offering insights into the data needed to achieve state-of-the-art speed estimation.

15:30-15:35, Paper WeCT16.7
NuExo: A Wearable Exoskeleton Covering All Upper Limb ROM for Outdoor Data Collection and Teleoperation of Humanoid Robots

Zhong, Rui	National University of Defense Technology
Cheng, Chuang	National University of Defense Technology
Xu, Junpeng	National University of Defense Technology
Wei, Yantong	National University of Defense Technology
Guo, Ce	National University of Defense Technology
Zhang, Daoxun	National University of Defense Technology
Dai, Wei	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Prosthetics and Exoskeletons, Telerobotics and Teleoperation, Humanoid Robot Systems Abstract: The evolution from motion capture and teleoperation to robot skill learning has emerged as a hotspot and critical pathway for advancing embodied intelligence. However, existing systems still face a persistent gap in simultaneously achieving four objectives: accurate tracking of full upper limb movements over extended durations (Accuracy), ergonomic adaptation to human biomechanics (Comfort), versatile data collection (e.g., force data) and compatibility with humanoid robots (Versatility), and lightweight design for outdoor daily use (Convenience). We present a wearable exoskeleton system, incorporating user-friendly immersive teleoperation and multi-modal sensing collection to bridge this gap. Due to the features of a novel shoulder mechanism with synchronized linkage and timing belt transmission, this system can adapt well to compound shoulder movements and replicate 100% coverage of natural upper limb motion ranges. Weighing 5.2kg, NuExo supports backpack-type use and can be conveniently applied in daily outdoor scenarios. Furthermore, we develop a unified intuitive teleoperation framework and a comprehensive data collection system integrating multi-modal sensing for various humanoid robots. Experiments across distinct humanoid platforms and different users validate our exoskeleton's superiority in motion range and flexibility, while confirming its stability in data collection and teleoperation accuracy in dynamic scenarios.


WeCT17	210A
Intelligent Transportation Systems 3	Regular Session
Chair: Sheng, Weihua	Oklahoma State University
Co-Chair: Chen, Yongbo	University of Technology Sydney

15:00-15:05, Paper WeCT17.1
Diverse and Adaptive Behavior Curriculum for Autonomous Driving: A Student-Teacher Framework with Multi-Agent RL

Abouelazm, Ahmed	FZI Forschungszentrum Informatik
Ratz, Johannes	Karlsruher Institut Für Technologie
Schörner, Philip	FZI Research Center for Information Technology
Zöllner, Johann Marius	FZI Forschungszentrum Informatik
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Simulation and Animation Abstract: Autonomous driving faces challenges in navigating complex real-world traffic, requiring safe handling of both common and critical scenarios. End-to-end systems, which unify perception, planning, and decision-making, offer a promising alternative to modular approaches. Reinforcement learning (RL), a prominent method in end-to-end driving, enables agents to learn through trial and error in simulations. However, RL training often relies on rule-based traffic scenarios, limiting generalization. Additionally, current methods for scenario generation focus heavily on critical scenarios, neglecting a balance with routine driving behaviors. Curriculum learning, which progressively trains agents on increasingly complex tasks, is a promising approach to improving the robustness and coverage of driving behaviors in RL training. However, existing research mainly emphasizes manually designed curricula, focusing on scenery and actor placement rather than traffic behavior dynamics. This work introduces a novel student-teacher framework for automatic curriculum learning. The teacher, a graph-based multi-agent RL component, dynamically generates adaptive traffic behaviors across diverse difficulty levels. An adaptive mechanism adjusts task difficulty based on student performance, ensuring exposure to behaviors ranging from common to critical behaviors. The student, though exchangeable, is realized as a deep RL model with partial observation of the environmental state, reflecting the limited observability in the real world. Results demonstrate the teacher’s ability to generate diverse traffic behaviors. The student, trained with adaptive curricula, outperformed agents trained on rule-based traffic, achieving higher rewards and demonstrating a balanced, assertive driving style.

15:05-15:10, Paper WeCT17.2
Quantifying and Modeling Driving Style in Trajectory Forecasting

Zheng, Laura	University of Maryland, College Park
Yaghoubi Araghi, Hamidreza	University of Maryland, College Park
Wu, Tony	University of Maryland, College Park
Thalapanane, Sandeep	University of Maryland, College Park
Zhou, Tianyi	University of Maryland, College Park
Lin, Ming C.	University of Maryland at College Park
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation Abstract: Trajectory forecasting has become a popular deep learning task due to its relevance for scenario simulation for autonomous driving. Specifically, trajectory forecasting predicts the trajectory of a short-horizon future for specific human drivers in a particular traffic scenario. Robust and accurate future predictions can enable autonomous driving planners to optimize for low-risk and predictable outcomes for human drivers around them. Although some work has been done to model driving style in planning and personalized autonomous polices, a gap exists in explicitly modeling human driving styles for trajectory forecasting of human behavior. Human driving style is most certainly a correlating factor to decision making, especially in edge-case scenarios where risk is nontrivial, as justified by the large amount of traffic psychology literature on risky driving. So far, the current real-world datasets for trajectory forecasting lack insight on the variety of represented driving styles. While the datasets may represent real-world distributions of driving styles, we posit that fringe driving style types may also be correlated with edge-case safety scenarios. In this work, we conduct analyses on existing real-world trajectory datasets for driving and dissect these works from the lens of driving styles, which is often intangible and non-standardized.

15:10-15:15, Paper WeCT17.3
JAM: Keypoint-Guided Joint Prediction after Classification-Aware Marginal Proposal for Multi-Agent Interaction

Lin, Fangze	Shenzhen University
He, Ying	Shenzhen University
Yu, Fei	Guangming Lab
Zhang, Hong	SUSTech
Keywords: Intelligent Transportation Systems, Long term Interaction, Deep Learning Methods Abstract: Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named keypoint-guided joint prediction after classification-aware marginal proposal (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.

15:15-15:20, Paper WeCT17.4
DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving

Wang, Dingrui	Technical University of Munich
Kaufeld, Marc	Technical University of Munich
Betz, Johannes	Technical University of Munich
Keywords: Intelligent Transportation Systems, Motion and Path Planning Abstract: We present a novel autonomous driving framework, DualAD, designed to imitate human reasoning during driving. DualAD comprises two layers: a rule-based motion planner at the bottom layer that handles routine driving tasks requiring minimal reasoning, and an upper layer featuring a rule-based text encoder that converts driving scenarios from absolute states into text description. This text is then processed by a large language model (LLM) to make driving decisions. The upper layer intervenes in the bottom layer's decisions when potential danger is detected, mimicking human reasoning in critical situations. Closed-loop experiments demonstrate that DualAD, using a zero-shot pre-trained model, significantly outperforms both rule-based and learning-based motion planners when interacting with reactive agents. Our experiments also highlight the effectiveness of the text encoder, which considerably enhances the model's scenario understanding. Additionally, the integrated DualAD model improves with stronger LLMs, indicating the framework's potential for further enhancement. Code and benchmarks are available at href{https://github.com/TUM-AVS/DualAD}{texttt{github.com /TUM-AVS/DualAD}}

15:20-15:25, Paper WeCT17.5
Learning through Retrospection: Improving Trajectory Prediction for Automated Driving with Error Feedback

Hagedorn, Steffen	Universitaet Zu Lübeck, Robert Bosch GmbH
Distelzweig, Aron	Albert-Ludwigs-Universität Freiburg
Hallgarten, Marcel	University of Tübingen, Robert Bosch GmbH
Condurache, Alexandru Paul	University of Luebeck, Institute for Signal Processing
Keywords: Intelligent Transportation Systems, Deep Learning Methods, AI-Based Methods Abstract: In automated driving, predicting trajectories of surrounding vehicles supports reasoning about scene dynamics and enables safe planning for the ego vehicle. However, existing models handle predictions as an instantaneous task of forecasting future trajectories based on observed information. As time proceeds, the next prediction is made independently of the previous one, which means that the model cannot correct its errors during inference and will repeat them. To alleviate this problem and better leverage temporal data, we propose a novel retrospection technique. Through training on closed-loop rollouts the model learns to use aggregated feedback. Given new observations it reflects on previous predictions and analyzes its errors to improve the quality of subsequent predictions. Thus, the model can learn to correct systematic errors during inference. Comprehensive experiments on nuScenes and Argoverse demonstrate a considerable decrease in minimum Average Displacement Error of up to 31.9 % compared to the state-of-the-art baseline without retrospection. We further showcase the robustness of our technique by demonstrating a better handling of out-of-distribution scenarios with undetected road-users.

15:25-15:30, Paper WeCT17.6
MADI: Malicious Agent Detection and Isolation in Mixed Autonomy Traffic Systems

Hao, Wei	Nanjing University
Liu, Huaping	Tsinghua University
Li, Wenjie	Nanjng University
Gou, Chang	Nanjing University
Chen, Lijun	Nanjing University
Keywords: Intelligent Transportation Systems, Safety in HRI, Human-Robot Teaming Abstract: Mixed autonomy traffic systems face significant security challenges when malicious agents disrupt coordination between autonomous and human-driven vehicles. We present Malicious Agent Detection and Isolation (MADI), a framework addressing two critical forms of disruptive behavior: path order violations at coordination points and strategic congestion generation. MADI integrates dual-mechanism detection with temporal consistency analysis to identify sophisticated malicious behaviors while filtering transient anomalies that could trigger false positives. Upon detection, our framework employs adaptive isolation strategies including enlarged safety boundaries and dynamic priority adjustment. Extensive experiments in simulated highway and urban environments demonstrate that MADI achieves up to 91% detection accuracy with only 4% false positives, significantly outperforming rule-based, anomaly-based, and single-criterion methods. The framework reduces travel time impacts by 25.5% and near-collision events by 76.5% in adversarial conditions, demonstrating its effectiveness for enhancing safety and efficiency in mixed autonomy traffic.

15:30-15:35, Paper WeCT17.7
CMP: Cooperative Motion Prediction with Multi-Agent Communication

Wang, Zehao	University of California, Riverside
Wang, Yuping	University of Michigan
Wu, Zhuoyuan	Meituan
Ma, Hengbo	UC Berkeley, Alumni
Li, Zhaowei	North Carolina State University
Qiu, Hang	University of California, Riverside
Li, Jiachen	University of California, Riverside
Keywords: Intelligent Transportation Systems, Cooperating Robots Abstract: The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as model input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies on the OPV2V and V2V4Real datasets, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction. In particular, CMP reduces the average prediction error by 16.4% with fewer missing detections compared with the no cooperation setting and by 12.3% compared with the strongest baseline. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios. The code can be found on the project website: https://cmp-cooperative-prediction.github.io/.


WeCT18	210B
Multi-Modular Robot Systems 2	Regular Session
Chair: Kong, He	Southern University of Science and Technology
Co-Chair: Ge, Ming-Feng	China University of Geosciences (Wuhan)

15:00-15:05, Paper WeCT18.1
MARS-FTCP: Robust Fault-Tolerant Control and Agile Trajectory Planning for Modular Aerial Robot Systems

Huang, Rui	National University of Singapore
Zhang, Zhenyu	National University of Singapore
Tang, Siyu	National University of Singapore
Cai, Zhiqian	National University of Singapore
Zhao, Lin	National University of Singapore
Keywords: Cellular and Modular Robots, Failure Detection and Recovery, Collision Avoidance Abstract: Modular Aerial Robot Systems (MARS) consist of multiple drone units that can self-reconfigure to adapt to various mission requirements and fault conditions. However, existing fault-tolerant control methods exhibit significant oscillations during docking and separation, impacting system stability. To address this issue, we propose a novel fault-tolerant control reallocation method that adapts to an arbitrary number of modular robots and their assembly formations. The algorithm redistributes the expected collective force and torque required for MARS to individual units according to their moment arm relative to the center of MARS mass. Furthermore, we propose an agile trajectory planning method for MARS of arbitrary configurations, which is collision-avoiding and dynamically feasible. Our work represents the first comprehensive approach to enable fault-tolerant and collision avoidance flight for MARS. We validate our method through extensive simulations, demonstrating improved fault tolerance, enhanced trajectory tracking accuracy, and greater robustness in cluttered environments. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-FTCP/

15:05-15:10, Paper WeCT18.2
ZBOT: A Novel Modular Robot Capable of Active Transformation from Snake to Bipedal Configuration through RL

Zhou, Nanlin	Harbin Institute of Technology
Zhao, Sikai	Harbin Institute of Technology
Luo, Hang	Harbin Institute of Technology
Han, Kai	Harbin Institute of Technology
Yang, Zhiyuan	Harbin Institute of Technology
Qi, Jian	Harbin Institute of Technology
Zhao, Ning	Harbin Institute of Technology
Zhao, Jie	Harbin Institute of Technology
Zhu, Yanhe	Harbin Institute of Technology
Keywords: Cellular and Modular Robots, Reinforcement Learning, Biologically-Inspired Robots Abstract: In recent years, significant progress has been made in the prototype design and control methodologies of modular snake robots. However, there is still relatively little research on the potential enabled by the active morphological transformation of robots. This paper presents a novel modular snake robot capable of morphing into a bipedal configuration. The robot, ZBOT, is composed of some independent and homogeneous unit modules (named ZBot) connected in series. Each ZBot module has a dual-motor-driven 1-DoF rotational joint, which can rotate continuously, provide a large output torque and achieve backlash elimination. There are four connection orientations between adjacent modules. This paper proposes an articulation configuration, which enables the snake robot to achieve the active transformation from a snake form to a bipedal form. Meanwhile, through reinforcement learning (RL), movements including the stand-up gait are trained and verified in the IsaacSim/Lab simulation environment. This research will advance snake robots beyond surface-dependent locomotion, endowing them with more possibilities, unlocking greater potential for versatile applications.

15:10-15:15, Paper WeCT18.3
Observability-Driven Assignment of Heterogeneous Sensors for Multi-Target Tracking

Rakhshan, Seyed Ali	Southern University Science and Technology
Golestani, Mehdi	Iran University of Science and Technology
Kong, He	Southern University of Science and Technology
Keywords: Sensor Networks, Multi-Robot Systems, Optimization and Optimal Control Abstract: This paper addresses the challenge of assigning heterogeneous sensors (i.e., robots with varying sensing capabilities) for multi-target tracking. We classify robots into two categories: (1) sufficient sensing robots, equipped with range and bearing sensors, capable of independently tracking targets, and (2) limited sensing robots, which are equipped with only range or bearing sensors and need to at least form a pair to collaboratively track a target. Our objective is to optimize tracking quality by minimizing uncertainty in target state estimation through efficient robot-to-target assignment. By leveraging matroid theory, we propose a greedy assignment algorithm that dynamically allocates robots to targets to maximize tracking quality. The algorithm guarantees constant-factor approximation bounds of 1/3 for arbitrary tracking quality functions and 1/2 for submodular functions, while maintaining polynomial-time complexity. Extensive simulations demonstrate the algorithm’s effectiveness in accurately estimating and tracking targets over extended periods. Furthermore, numerical results confirm that the algorithm's performance is close to that of the optimal assignment, highlighting its robustness and practical applicability.

15:15-15:20, Paper WeCT18.4
An Online Optimization-Based Trajectory Planning Approach for Cooperative Landing Tasks

Chen, Jingshan	University of Stuttgart
Xu, Lihan	University of Stuttgart
Ebel, Henrik	LUT University
Eberhard, Peter	Institute of Engineering and Computational Mechanics, University
Keywords: Task and Motion Planning, Optimization and Optimal Control, Multi-Robot Systems Abstract: This paper presents a real-time trajectory planning scheme for a heterogeneous multi-robot system (consisting of a quadrotor and a ground mobile robot) for a cooperative landing task, where the landing position, landing time, and coordination between the robots are determined autonomously under the consideration of feasibility and user specifications. The proposed framework leverages the potential of the complementarity constraint as a decision-maker and an indicator for diverse cooperative tasks and extends it to the collaborative landing scenario. In a potential application of the proposed methodology, a ground mobile robot may serve as a mobile charging station and coordinates in real-time with a quadrotor to be charged, facilitating a safe and efficient rendezvous and landing. We verified the generated trajectories in simulation and real-world applications, demonstrating the real-time capabilities of the proposed landing planning framework.

15:20-15:25, Paper WeCT18.5
Bidirectional Task-Motion Planning Based on Hierarchical Reinforcement Learning for Strategic Confrontation

Wu, Qizhen	Beihang University
Chen, Lei	Beijing Institute of Technology
Liu, Kexin	Beihang University
Lv, Jinhu	Beihang University
Keywords: Task and Motion Planning, Reinforcement Learning, Multi-Robot Systems Abstract: In swarm robotics, confrontation scenarios, including strategic confrontations, require efficient decision-making that integrates discrete commands and continuous actions. Traditional task and motion planning methods separate decision-making into two layers, but their unidirectional structure fails to capture the interdependence between these layers, limiting adaptability in dynamic environments. Here, we propose a novel bidirectional approach based on hierarchical reinforcement learning, enabling dynamic interaction between the layers. This method effectively maps commands to task allocation and actions to path planning, while leveraging cross-training techniques to enhance learning across the hierarchical framework. Furthermore, we introduce a trajectory prediction model that bridges abstract task representations with actionable planning goals. In our experiments, it achieves over 80% in confrontation win rate and under 0.01 seconds in decision time, outperforming existing approaches. Demonstrations through large-scale tests and real-world robot experiments further emphasize the generalization capabilities and practical applicability of our method.

15:25-15:30, Paper WeCT18.6
Distributed Autonomous Safe Flight Planning for Multiple UAVs in Unknown Environments

Yang, Fan	Hangzhou Dianzi University
Lu, Qiang	Hangzhou Dianzi University
Lin, Jianxiao	Hangzhou Dianzi University
Zhang, Botao	Hangzhou Dianzi University
Choi, Youngjin	Hanyang University
Keywords: Task and Motion Planning, Distributed Robot Systems, Motion and Path Planning Abstract: In this paper, two technologies are proposed to deal with the problem of flight safty of multiple unmanned aerial vehicles (UAVs) in unknown environments. One technology is to optimize the front-end path generated by traditional path planning methods in order to better match the dynamics of UAVs to obtain the back-end movement trajectories of UAVs. The other technology is to introduce the collision detection adjustment region such that collision avoidance can be realized for multiple UAVs by dynamic replanning of UAV's trajectory under local neighborhood communication. Finally, according to simulation results and real-world experimental results, the effectiveness of the proposed technologies is verified for the flight safty of multiple UAVs in unknown environments.

15:30-15:35, Paper WeCT18.7
Predefined-Time Formation Control of NMSVs with External Disturbance Via Vector Control Lyapunov Functions-Based Method (I)

Chi, Ming	Huazhong University of Science and Technology
Zhang, Wen-Tao	Huazhong University of Science and Technology
Liu, Zhi-Wei	Huazhong University of Science & Engineering
Xu, Jing-Zhe	Huazhong University of Science and Technology
Yan, Huaicheng	East China University of Science and Technology
Ge, Ming-Feng	China University of Geosciences (Wuhan)
Keywords: Distributed Robot Systems, Multi-Robot Systems, Networked Robots Abstract: This article investigates the problem of predefined-time distributed formation tracking for networked marine surface vehicles in the presence of external disturbances. To address this problem, we propose a novel hierarchical predefined-time formation control (HPTFC) framework that integrates a novel sliding mode surface with the vector control Lyapunov functions-based (VCLFs) method. Specifically, the local control layer based on VCLFs is established, which circumvents the usual requirement for positive-definite Lyapunov functions and only necessitates semidefinite positive components. This enables us to search for more appropriate control algorithms in a broader solution space generated by more suitable Lyapunov functions with more general conditions. Through comprehensive theoretical analysis, we demonstrate that the proposed HPTFC framework achieves predefined-time convergence successfully. Eventually, numerical simulations are exhibited to illustrate the effectiveness and superiority of the proposed HPTFC scheme.

15:35-15:40, Paper WeCT18.8
EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control

Wan, Hanwen	The Chinese University of Hong Kong，Shen Zhen
Chen, Yifei	The Chinese University of Hongkong, Shenzhen
Deng, Yixuan	The Chinese University of Hong Kong, Shenzhen
Wei, Zeyu	The School of Computer Science, the University of Sydney
Li, Dongrui	The Hong Kong Polytechnic University
Lin, Zexin	School of Science and Engineering, the Chinese University of Hon
Wu, Donghao	The Chinese University of Hong Kong, Shenzhen
Cheng, Jiu	The Chinese University of Hong Kong, Shenzhen
Ji, Xiaoqiang	The Chinese University of Hong Kong, Shenzhen
Keywords: Task Planning, Agent-Based Systems, Multi-Robot Systems Abstract: This paper introduces EmbodiedAgent, a hierarchical framework for heterogeneous multi-robot control. EmbodiedAgent addresses critical limitations of hallucination in impractical tasks. Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills while dynamically validating actions against environmental constraints. We present MultiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios, including a subset of impractical cases to mitigate hallucination. To evaluate performance, we propose the Robot Planning Assessment Schema (RPAS), combining automated metrics with LLM-aided expert grading. Experiments demonstrate EmbodiedAgent’s superiority over state-of-the-art models, achieving 71.85% RPAS score. Real-world validation in an office service task highlights its ability to coordinate heterogeneous robots for long-horizon objectives.


WeCT19	210C
Biologically-Inspired Robots 3	Regular Session
Chair: Fan, Xinjian	Soochow University

15:00-15:05, Paper WeCT19.1
A Variable Stiffness Fin for Manta Ray-Inspired Robots with Two Motion Modals

Xiang, Yang	Huazhong University of Science and Technology
Gu, Le	Huazhong University of Science and Technology
Ye, Kangjie	Huazhong University of Science and Technology
Zhang, Zhenwei	Huazhong University of Science and Technology
Gong, Zeyu	Huazhong University of Science and Technology
Tao, Bo	Huazhong University of Science and Technology
Keywords: Biologically-Inspired Robots, Mechanism Design Abstract: Manta ray-inspired robots exhibit broad application prospects. In nature, different manta rays possess different stiffness of pectoral fins to survive in specific environment. Some rays present stiffer fins that enable an oscillation, achieving large propulsion to adapt to a pelagic lifestyle. While some rays exhibit softer fins that produce a wave-like undulation, enhancing motion efficiency to survive in a benthic environment. Manta ray-inspired robots face the need to switch between maneuverability and endurance in complex environments and diverse tasks. However, existing pectoral fins in manta ray-inspired robots typically possess a simple structure with one kind of stiffness and motion modal. In this article, we design a variable stiffness fin capable of rigid swing and soft bending modals, mimicking the oscillation and undulation observed in manta rays. The fin’s moment of inertia and motion constraints can be altered through the rotation of the fin core, enabling it to switch between the two modals. Based on the bending principles of biological fins, a deflection enhancement for soft modal is proposed to concentrate the deflection at the fin tip. Shapes between the two modals are modeled by geometric constraints to theoretically compare their difference. Dynamics tests and swimming validation demonstrate that the variable stiffness fin, with two motion modals, enables switching between large propulsion and high efficiency. These capabilities help adjust the maneuverability and endurance of manta ray-inspired robots for complex environments.

15:05-15:10, Paper WeCT19.2
Bio-Inspired Soft Variable-Stiffness Prehensile Tail Enabling Versatile Grasping and Enhancing Dynamic Mobility

An, Jiajun	The Chinese University of Hong Kong
Zhang, Huayu	Southeast University
Wang, Shengzhi	The Chinese University of Hong Kong
Li, Zelin	Purdue University
Lin, Han	Purdue Universtiy
Zeng, Zihan Oliver	Purdue University
Wen, Qing	Purdue University
Gan, Xingming	Mangdang Technology Co., Limited
Gan, Dongming	Purdue University
Kaur, Upinder	Purdue University
Ma, Xin	Chinese Univerisity of HongKong
Keywords: Biologically-Inspired Robots, Mechanism Design, Soft Robot Applications Abstract: In nature, prehensile tails serve as versatile and essential appendages for animals, facilitating both grasping and enhanced mobility. Although existing robotic tails effectively contribute to mobility across a range of behaviors, they lack versatile object-grasping capabilities. Inspired by these biological capabilities, a soft robotic prehensile tail is presented that uniquely integrates object manipulation with dynamic mobility enhancement for quadrupedal robots. This robotic tail offers a threefold stiffness variation and achieves large-angle bending (~636°) at the tail tip, thereby enabling secure and adaptable grasping. By adjusting its stiffness, the tail can conform to various shapes in a soft state and lift objects of different weights in a stiff state, demonstrating versatile grasping. The stiffened tail reliably supports the robot’s body load (e.g., when hanging on a rod) and facilitates rapid, precise dynamic adjustments. Moreover, a novel synergy is revealed whereby grasped objects increase the tail’s inertial effects, thereby enhancing the robot’s dynamic capabilities during rapid maneuvers—a unique feature that transforms manipulation tasks into mobility advantages.

15:10-15:15, Paper WeCT19.3
TurBot: A Turtle-Inspired Quadruped Robot Using Topology Optimized Soft-Rigid Hybrid Legs (I)

Sun, Yilun	Technical University of Munich
Pancheri, Felix	Technical University of Munich
Rehekampff, Christoph	Technische Universität München
Lueth, Tim C.	Technical University of Munich
Keywords: Biologically-Inspired Robots, Legged Robots, Soft Robot Materials and Design Abstract: Quadruped robots are used for a wide variety of transportation and exploration tasks due to their high dexterity. Currently, many studies utilize soft robotic legs to replace rigid-link-based legs, with the aim to improve quadruped robots' adaptability to complex environments. However, the conventional soft legs still face the challenge of limited load-bearing capacity. To cope with this issue, we propose in this work a type of soft-rigid hybrid leg, which is synthesized by using a multistage topology optimization method. A simplified model is also created to describe the kinematics of the synthesized soft leg. Using the realized legs, we have developed a turtle-inspired quadruped robot called TurBot. By mimicking the walking pattern of a turtle, two motion gaits (straight-line walking and turning) are designed to realize the robotic locomotion. Experiments are also conducted to evaluate the walking performance of TurBot. Results show that the realized robot can achieve stable straight-line walking and turning motions. In addition, TurBot can carry up to 500 g extra weight while walking, which is 126% of its own body weight. Moreover, different locomotion tests have also successfully verified TurBot's ability to adapt to complex environments.

15:15-15:20, Paper WeCT19.4
Application of Bionic Gait Control Method in Soft Robotic Fish (I)

Liu, Sijia	Harbin Engineering University
Hou, Yuedi	Harbin Engineering University
Liu, Chunbao	Jilin University
Keywords: Biologically-Inspired Robots, Soft Sensors and Actuators, Soft Robot Applications Abstract: In this study, a hydraulic autonomous soft robotic tuna (HasorTuna) was developed from the perspective of bridging technology and physiology. HasorTuna processes a muscle-inspired hydraulic soft actuator. A double-cylinder plunger pump driven by servos supplies alternating pressure to the driving units on both sides of the soft actuator. Furthermore, a Lagrangian dynamic robotic fish model with a continuum fishtail and a passive flexible joint was developed to explore propulsive performance, and it was validated by extensive experiments and simulations. Prompted by the driving differentiation characteristics of fish muscle, we performed experiments to study how the activation duty cycle of the driving units affects the swimming performance. The results show that the low duty cycles are more suitable at low tailbeat frequencies, whereas high duty cycles gradually became advantageous with the increase in frequency. This finding is not only consistent with the known habits of fish and but can also be used to decrease the cost of transport of robotic fish. HasorTuna demonstrated a speed of 0.84 body lengths per second and a cost of transport of 11 J kgpmb{^{-1}} mpmb{^{-1}}. Finally, the three-dimensional swimming capability of HasorTuna was verified via turning and diving–floating tests. Results from this study provide a valuable reference for bioinspired research of aquatic mechanical design and locomotion control.

15:20-15:25, Paper WeCT19.5
Design and Modeling of an Integral Molding Flexible Tail for Robotic Fish (I)

Tong, Ru	Institute of Automation, Chinese Academy of Sciences
Wu, Zhengxing	Institute of Automation, Chinese Academy of Sciences
Li, Sijie	Institute of Automation, Chinese Academy of Sciences
Chen, Di	Institute of Automation, Chinese Academy of Sciences
Wang, Jian	Institute of Automation, Chinese Academy of Sciences
Tan, Min	Institute of Automation, Chinese Academy of Sciences
Yu, Junzhi	Chinese Academy of Sciences
Keywords: Biologically-Inspired Robots, Biomimetics, Dynamics Abstract: Tail flexibility optimization is crucial for improving the speed and efficiency of robotic fish. However, reliable flexible tail structures and accessible modeling methods are still in the exploratory stage. This article proposes a novel integral molding flexible fish tail (IMFFT) characterized by continuous flexibility, a hollow air cavity, and an embedded skeletal structure. These features empower the tail with continuous flexible propulsion capabilities, neutral buoyancy for stable fish posture, and limited compressibility for withstanding water pressure. In addition, a predictive-network-based model for the IMFFT is developed through thrust data acquisition, dynamic analysis, and predictive network training. Specifically, the predictive network enables the prediction of pattern parameters of passive angles on the flexible tail. Deploying the IMFFT on our self-developed robotic tuna, performance tests demonstrate significant improvements, including a high swimming speed of 3.28 body lengths per second, an average speed improvement of 25.30%, an average cost of transport (COT) reduction of 24.45%, and a 57.12% reduction in roll angle fluctuation range due to the neutral buoyancy of the fish tail, which competes favorably with rigid fish tails and other flexible tail structures. This study provides novel guidance for optimizing the flexibility of underwater bioinspired robots.

15:25-15:30, Paper WeCT19.6
BlueKoi: Combining a Tuna-Inspired Tail and Koi-Inspired Body Bending for Maneuverability

Sha, Irene	Princeton University
Quinn, Daniel	University of Virginia
Nagpal, Radhika	Harvard University
Keywords: Biologically-Inspired Robots, Biomimetics, Marine Robotics Abstract: As marine ecosystems face rapid declines, field observations have become essential for better understanding our oceans. Fish-inspired robots are a promising solution, as they are less disruptive than propeller-based approaches in sensitive environments. However, in both fish and fish-inspired robots, there is a trade-off between speed (that favors rigid bodies) and maneuverability (that favors flexible bodies). In this work, we present BlueKoi, an untethered, fish-inspired robotic platform that leverages both a stiff tuna-inspired tail for efficient swimming and a koi-inspired rotating head for maneuvering, reaching speeds of 1.84 body lengths per second and a turn radius of 1.93 body lengths. We experimentally quantify the robot’s turn radius under varying conditions and develop a reduced-order model to both understand the turning behavior and inform future design decisions, without needing explicit measurements of hydrodynamic coefficients. Furthermore, we show that our model is not only accurate but also capable of extending simulations to account for future design modifications. By decoupling propulsion and maneuverability, BlueKoi is a scalable and modular platform that enables adaptability for diverse sensing and navigation needs.

15:30-15:35, Paper WeCT19.7
JiAo: A Versatile Snake Robot with Elliptical Wheels for Multimodal Locomotion

Zhao, ZiZhu	TiangongUniversity
Wang, Jianming	Tiangong University
Sumantri, Michael Albert	Tiangong University
Zhang, Chenghui	Tiangong University
Feng, Sihan	Tiangong University
Xiao, Xuan	Tiangong University
Meng, Shiyong	Tiangong University
Keywords: Wheeled Robots, Biologically-Inspired Robots, Flexible Robotics Abstract: This paper presents a novel snake robot, JiAo, equipped with elliptical wheels that enable both wheeled and body-based locomotion. First, the design of each module of the snake robot is described, which consists of the body link and the transmission system of the elliptical wheels. Second, distinct control systems for wheeled and body-based locomotion are proposed. Finally, the prototype has been successfully developed and various experiments have been conducted, including crossing grasslands, crossing gaps, climbing slopes, navigating pipelines and climbing cylinders. In conclusion, JiAo demonstrates its versatility by effectively performing a wide range of tasks in various challenging scenarios.

15:35-15:40, Paper WeCT19.8
A Bio-Inspired Robotic Electric Ray: Design of Multimodal Locomotion with Grasping Function

Mo, Yuyang	South China University of Technology
Xie, Xing	South China University of Technology
Hong, Zicun	South China University of Technology
Zhuang, Huiping	South China University of Technology
Zhong, Yong	South China University of Technology
Keywords: Biologically-Inspired Robots, Underactuated Robots Abstract: In nature, fish locomotion is primarily classified into the BCF (body and caudal fin) propulsion mode and the MPF (median and paired fin) propulsion mode. This paper presents a bio-inspired robotic electric ray that integrates a BCF-mode caudal fin with MPF-mode pectoral fins. The caudal fin consists of a set of wire-driven, multi-joint active segments coupled with a soft, compliant segment, while each symmetrical pectoral fin incorporates two sets of wire-driven joints and a soft fin structure. The undulatory motion of the MPF-mode pectoral fins enables the robotic ray to execute maneuvers such as forward swimming, backward swimming, and in-place turning, whereas the BCF-mode caudal fin enhances linear swimming and turning capabilities. Experimental results demonstrate that MPF-mode swimming achieves a maximum speed of 0.190 m/s (0.358 BL/s), while the cooperative propulsion of MPF and BCF modes enables speeds of up to 0.352 m/s (0.664 BL/s). Notably, the robotic electric ray’s large pectoral fins can function as grippers, allowing it to grasp and transport objects using caudal fin propulsion, thereby facilitating object manipulation tasks.


WeCT20	210D
Grasping & Manipulation 3	Regular Session
Chair: Liu, Tengyu	Beijing Institute for General Artificial Intelligence

15:00-15:05, Paper WeCT20.1
Aerial Grasping Via Maximizing Delta-Arm Workspace Utilization

Chen, Haoran	Sun Yat-Sen University
Deng, Weiliang	Sun Yat-Sen University
Ye, Biyu	Sun Yat-Sen University
Xiong, Yifan	Sun Yat-Sen University
Pan, Zongliang	Shenzhen ePropulsion Technology Limited
Lyu, Ximin	Sun Yat-Sen University
Keywords: Aerial Systems: Applications, Manipulation Planning, Deep Learning in Grasping and Manipulation Abstract: Workspace limitations restrict the operational capabilities and range of motion for systems with robotic arms. Maximizing workspace utilization has the potential to provide better solutions for aerial manipulation tasks, increasing the system's flexibility and operational efficiency. In this paper, we introduce a novel planning framework for aerial grasping that maximizes workspace utilization. We formulate an optimization problem to optimize the aerial manipulator's trajectory, incorporating task constraints to achieve efficient manipulation. To address the challenge of incorporating the delta arm's non-convex workspace into optimization constraints, we leverage a Multilayer Perceptron (MLP) to map the point positions to feasibility probabilities. Furthermore, we employ Reversible Residual Networks (RevNet) to approximate the complex forward kinematics of the delta arm, utilizing its efficient model gradients to further eliminate workspace constraints. We validate our methods in simulations and real-world experiments to demonstrate their effectiveness.

15:05-15:10, Paper WeCT20.2
Ag2x2: A Robust Agent-Agnostic Visual Representation Boosts Zero-Shot Learning of Bimanual Robotic Manipulation

Xiong, Ziyin	Peking University
Chen, Yinghan	Peking University
Li, Puhao	Tsinghua University
Zhu, Yixin	Peking University
Liu, Tengyu	Beijing Institute for General Artificial Intelligence
Huang, Siyuan	Beijing Institute for General Artificial Intelligence
Keywords: Bimanual Manipulation, Deep Learning in Grasping and Manipulation, Representation Learning Abstract: Bimanual manipulation, while fundamental to human daily activities, remains a significant challenge in robotics due to its inherent complexity. While recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos, existing methods overlook key agent-specific information crucial for bimanual task coordination, such as the end-effector's position. We propose Ag2x2, a novel framework that advances the autonomous acquisition of bimanual manipulation skills through agent-agnostic and coordination-aware visual representations that jointly encode object and hand motion patterns. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 bimanual tasks from Bi-DexHands and PerAct2, including challenging tasks with deformable objects like ropes, outperforming baseline autonomous methods and surpassing the success rate of learning from expert-engineered rewards. Furthermore, we demonstrate that the learned representations can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition. By eliminating the need for expert supervision while maintaining robust performance across diverse tasks, Ag2x2 represents a significant step toward scalable, autonomous robotic learning of complex bimanual skills.

15:10-15:15, Paper WeCT20.3
ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model

Yu, Tengbo	Tsinghua University
Lu, Guanxing	Tsinghua Shenzhen International Graduate School, Tsinghua Univer
Yang, Zaijia	Nanyang Technological University
Deng, Haoyuan	Nanyang Technological University
Chen, Si	Tsinghua University
Lu, Jiwen	Tsinghua University
Ding, Wenbo	Tsinghua University
Hu, Guoqiang	Nanyang Technological University,
Tang, Yansong	Tsinghua University
Wang, Ziwei	Nanyang Technological University
Keywords: Bimanual Manipulation Abstract: Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose method, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks.

15:15-15:20, Paper WeCT20.4
OpAC: An Optimization-Augmented Control Framework for Single and Coordinated Multi-Arm Robotic Manipulation

Özcan, Melih	Middle East Technical University
Oguz, Ozgur S.	Bilkent University
Keywords: Bimanual Manipulation, Dual Arm Manipulation, Force Control Abstract: Robotic manipulation demands precise control over both contact forces and motion trajectories. While force control is essential for achieving compliant interaction and high-frequency adaptation, it is limited to operations in close proximity to the manipulated object and often fails to maintain stable orientation during extended motion sequences. Conversely, optimization-based motion planning excels in generating collision-free trajectories over the robot’s configuration space but struggles with dynamic interactions where contact forces play a crucial role. To address these limitations, we propose a multi-modal control framework that combines force control and optimization-augmented motion planning to tackle complex robotic manipulation tasks in a sequential manner, enabling seamless switching between control modes based on task requirements. Our approach decomposes complex tasks into subtasks, each dynamically assigned to one of three control modes: Pure optimization for global motion planning, pure force control for precise interaction, or hybrid control for tasks requiring simultaneous trajectory tracking and force regulation. This framework is particularly advantageous for bimanual and multi-arm manipulation, where synchronous motion and coordination among arms are essential while considering both the manipulated object and environmental constraints. We demonstrate the versatility of our method through a range of long-horizon manipulation tasks, including single-arm, bimanual, and multi-arm applications, highlighting its ability to handle both free-space motion and contact-rich manipulation with robustness and precision.

15:20-15:25, Paper WeCT20.5
Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning

Ding, Runyu	The University of Hong Kong
Qin, Yuzhe	UC San Diego
Zhu, Jiyue	University of California, San Diego
Jia, Chengzhe	University of California SanDiego
Yang, Shiqi	The Chinese University of Hong Kong, Shenzhen
Yang, Ruihan	UC San Diego
Qi, Xiaojuan	The University of Hong Kong, Hong Kong
Wang, Xiaolong	UC San Diego
Keywords: Bimanual Manipulation, Telerobotics and Teleoperation, Imitation Learning Abstract: Teleoperation is a crucial tool for collecting human demonstrations, but controlling robots with bimanual dexterous hands remains a challenge. Existing teleoperation systems struggle to handle the complexity of coordinating two hands for intricate manipulations. We introduce Bunny-VisionPro, a real-time bimanual dexterous teleoperation system that leverages a VR headset. Unlike previous vision-based teleoperation systems, we design novel low-cost devices to provide haptic feedback to the operator, enhancing immersion. Our system prioritizes safety by incorporating collision and singularity avoidance while maintaining real-time performance through innovative designs. Bunny-VisionPro outperforms prior systems on a standard task suite, achieving higher success rates and reduced task completion times. Moreover, the high-quality teleoperation demonstrations improve downstream imitation learning performance, leading to better generalizability. Notably, Bunny-VisionPro enables imitation learning with challenging multi-stage, long-horizon dexterous manipulation tasks, which have rarely been addressed in previous work. Our system's ability to handle bimanual manipulations while prioritizing safety and real-time performance makes it a powerful tool for advancing dexterous manipulation and imitation learning. Our web page is available at href{https://dingry.github.io/projects/bunny_visionpro.htm l}{url{https://dingry.github.io/projects/bunny_visionpro}}.

15:25-15:30, Paper WeCT20.6
Hierarchical Trajectory Planning Method for Piano-Playing Robot

Wang, Zirui	Harbin Institute of Technology
Zhang, Jiayu	Harbin Institute of Technology
Jiang, Wei	Harbin Institute of Technology
Jiang, Tao	Harbin Institute of Technology
Zhao, Jingdong	Harbin Institute of Technology
Zhao, Liangliang	Harbin Institute of Technology
Cao, Baoshi	Harbin Institute of Technology
Qi, Le	Harbin Institute of Technology
Yang, YuChen	Harbin Institute of Technology
Ni, Fenglei	State Key Laboratory of Robotics and System, Harbin Institute Of
Liu, Hong	Harbin Institute of Technology
Keywords: Bimanual Manipulation, Motion and Path Planning, Humanoid Robot Systems Abstract: Piano-playing tasks, which effectively demonstrate bimanual coordination capabilities in humanoid robots, are increasingly becoming a research focus. However, prior research has predominantly focused on Cartesian space trajectory planning without adequately addressing real-world obstacle avoidance constraints and manipulator acceleration limits. This paper proposes a hierarchical trajectory planning framework that systematically incorporates both obstacle avoidance and acceleration constraints. Firstly, discrete Cartesian path points are generated using a dynamic programming approach; secondly, joint space path points are derived considering obstacle avoidance and joint limit constraints through dynamic programming; thirdly, the joint space trajectory is interpolated using a Jacobian inverse-based method; finally, the trajectory is refined using Model Predictive Control (MPC). Experimental results demonstrate that the proposed method produces trajectories satisfying both obstacle avoidance and acceleration constraints, enabling fluent piano piece execution in real-world environments.

15:30-15:35, Paper WeCT20.7
Robotic Assembly of Deformable Linear Objects Via Curriculum Reinforcement Learning

Wu, Kai	South China University of Technology
Chen, Rongkang	South China University of Technology
Chen, Qi	Souch China University of Technology
Li, Weihua	South China University of Technology
Keywords: Assembly, Learning from Demonstration, Reinforcement Learning Abstract: The automated assembly of flexible objects presents significant challenges. Although significant progress has been made in the assembly of rigid objects, the methods used for rigid objects cannot be directly applied to flexible objects due to their infinite degrees of freedom. This study proposes a reinforcement learning (RL) based method for deformable cable insertion tasks executed with a universal 2-finger gripper. Firstly, a vision-based detection method is employed to monitor the cable's state in real-time, while a state classifier is introduced to provide real-time reward feedback for RL training. Secondly, an adaptive curriculum learning (CL) method is proposed to adjust the initial degree of cable bending through the success rate in the training process, allowing the RL agent to learn progressively from easier to more difficult tasks. The validation experiments were conducted on a type-C cable insertion task, where the robot grips the cable portion of the electrical connector. The results indicate that our method is capable of adapting to various degrees of cable bending, successfully handling cable configurations bent up to a maximum of 40° from its straight, unbent state, with an assembly success rate of over 90%.


WeCT21	101
Force and Tactile Sensing 3	Regular Session
Chair: Song, Ran	Shandong University

15:00-15:05, Paper WeCT21.1
MelumiTac: Vision-Based Tactile Sensor Using Mechanoluminescence for Dynamic Tactile and Nociceptive Perception

Bae, Sunggyu	Daegu Gyeongbuk Institute of Science and Technology
Song, Seongkyu	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Jeong, Soon Moon	Daegu Gyeongbuk Institute of Science and Technology
Park, Kyungseo	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Keywords: Force and Tactile Sensing, Soft Robot Materials and Design Abstract: This paper presents MelumiTac, a vision-based tactile (ViTAC) sensor enhanced with mechanoluminescent (ML) materials that emit green light under dynamic tactile stimuli. The integration of an ML elastomer generates self-illumination in response to dynamic tactile stimuli, enabling direct visualization of both dynamic tactile events and nociceptive responses while simultaneously tracking deformation in real-time. Experimental evaluations involving cyclic loading, in-plane motion, and piercing reveal a strong correlation between ML emission, stress rate, and localized deformation, thereby validating its multi-modal tactile sensing capabilities. Additionally, frame-by-frame analysis offers rich insights into the contact dynamics during physical interactions. These improvements, implemented within a small form factor of conventional ViTac sensor, render the approach highly accessible. Thus, we expect that the proposed solution will offer practical and unique advantages to engineers developing and applying vision-based multi-modal tactile sensors.

15:05-15:10, Paper WeCT21.2
VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

Zhang, Kaidi	Columbia University
Kim, DoGon	Columbia University
Chang, Eric T.	Columbia University
Liang, Hua Hsuan	Columbia University
He, Zhanpeng	Columbia University
Lampo, Kathryn	Columbia University
Wu, Philippe	Columbia University
Kymissis, Ioannis	Columbia University
Ciocarlie, Matei	Columbia University
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Sensor-based Control Abstract: The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object's acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck.

15:10-15:15, Paper WeCT21.3
TwinTac: A Wide-Range, Highly Sensitive Tactile Sensor with Real-To-Sim Digital Twin Sensor Model

Huang, Xiyan	ShanghaiTech University
Xu, Zhe	Shanghaitech University
Xiao, Chenxi	ShanghaiTech University
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators Abstract: Robot skill acquisition processes driven by reinforcement learning often rely on simulations to efficiently generate large-scale interaction data. However, the absence of simulation models for tactile sensors has hindered the use of tactile sensing in such skill learning processes, limiting the development of effective policies driven by tactile perception. To bridge this gap, we present TwinTac, a system that combines the design of a physical tactile sensor with its digital twin model. Our hardware sensor is designed for high sensitivity and a wide measurement range, enabling high quality sensing data essential for object interaction tasks. Building upon the hardware sensor, we develop the digital twin model using a real-to-sim approach. This involves collecting synchronized cross-domain data, including finite element method results and the physical sensor's outputs, and then training neural networks to map simulated data to real sensor responses. Through experimental evaluation, we characterized the sensitivity of the physical sensor and demonstrated the consistency of the digital twin in replicating the physical sensor’s output. Furthermore, by conducting an object classification task, we showed that simulation data generated by our digital twin sensor can effectively augment real-world data, leading to improved accuracy. These results highlight TwinTac's potential to bridge the gap in cross-domain learning tasks.

15:15-15:20, Paper WeCT21.4
Self-Decoupling and Hysteresis Compensation in a Soft Multi-Axis Force Sensor for Improved Performance

Peng, Cong	University of Nevada, Reno
Shen, Yantao	University of Nevada, Reno
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators Abstract: Conventional soft sensors often suffer from challenges such as crosstalk, hysteresis, and limited sensitivity, which hinder their performance and broader applicability. This paper presents a multi-axis piezoresistive soft force sensor with a square-column-shaped sensing structure designed to reduce the spatial footprint and mitigate partial axial coupling effects. By integrating a Wheatstone bridge-based resistive compensation strategy, the sensor achieves self-decoupling in multi-axis force measurements. Furthermore, a generalized Preisach hysteresis model is implemented to effectively compensate for hysteresis-induced nonlinearities and input-output loop effects, significantly enhancing sensing accuracy and precision. Extensive experimental validations confirm the effectiveness of the proposed self-decoupling and hysteresis compensation methodologies, demonstrating notable improvements in sensor reliability and performance. The findings of this study establish a comprehensive framework for advancing multi-dimensional soft force sensing technologies, with promising implications for high-precision engineering and biomedical applications.

15:20-15:25, Paper WeCT21.5
Tac-Man: Tactile-Informed Prior-Free Manipulation of Articulated Objects

Zhao, Zihang	Peking University
Li, Yuyang	Peking University
Li, Wanlin	Beijing Institute for General Artificial Intelligence (BIGAI)
Qi, Zhenghao	Tsinghua University
Ruan, Lecheng	Peking University
Zhu, Yixin	Peking University
Althoefer, Kaspar	Queen Mary University of London
Keywords: Force and Tactile Sensing, Haptics and Haptic Interfaces, Soft Sensors and Actuators, Tactile Robotics Abstract: Integrating robots into human-centric environments requires advanced skills for interacting with articulated objects, like doors and drawers. The unpredictability and diversity of these objects challenge the use of prior-based models, as ambiguities, imperfections, and unforeseen disturbances significantly reduce their reliability. We introduce a prior-free strategy, TacMan, focusing on stable contact during manipulation, independent of object priors. Utilizing tactile feedback, TacMan proficiently handles a wide range of articulated objects, even under unexpected disturbances, outperforming existing methods in both real-world and simulated settings. This demonstrates that tactile sensing alone can effectively manage articulated objects, highlighting the importance of detailed contact modeling and advancing robotic applications in human-centric environments.

15:25-15:30, Paper WeCT21.6
Distributed Contact Sensing Enabled by Vibration Propagation on Robot End-Effector

Tan, Wangbo	Harbin Institute of Technology, Shenzhen
Shao, Yitian	Harbin Institute of Technology, Shenzhen
Keywords: Force and Tactile Sensing, Mechanism Design Abstract: To match the dexterity of human hands, a robot's end-effector needs tactile sensing. However, current tactile sensing solutions often have complex electronics, making them impractical for covering the entire area of the manipulator without interfering with robot manipulation, especially for miniature objects. Here, we present a tactile sensing design that enables a single accelerometer positioned at the base of a robot's end-effector, to locate and estimate contact forces applied across a large region of the end-effector. Inspired by human tactile sensing, where a single afferent responds to skin vibrations over a large area, we integrated a string to transmit vibrations from remote contacts to the accelerometer. We utilized lightweight machine learning models to decode tactile information from the vibration signals captured by the accelerometer. Our experimental results demonstrate that we can accurately predict remote contact locations and force amplitudes, with a precision of 1.9 mm and 0.09 N, respectively. Additionally, the vibrations can be used to identify the surface materials of contact objects, achieving 99% accuracy in discriminating between 15 different materials. Our approach could help simplify and minimize the design of robot manipulators, enabling more delicate manipulation and reducing the hardware costs and data volume required for tactile sensing.

15:30-15:35, Paper WeCT21.7
Flexible Electronic Device with Multifunctional Tactile Perception for Enhanced Robotic Interaction

Mao, Chenhao	Zhejiang University
Jin, Jie	Zhejiang University
Mei, Deqing	Zhejiang University
Wang, Yancheng	Zhejiang University
Keywords: Force and Tactile Sensing, Haptics and Haptic Interfaces, Physical Human-Robot Interaction Abstract: The augmented reality (AR) and virtual reality (VR) for human-machine interactions (HMI) have been greatly attracted in industry. Traditional HMI devices are generally bulky, rigid and lacking of tactile perception will greatly limit their applications. Here, we proposed a novel flexible electronic device with multifunctional tactile sensing for the interaction with robots. The electronic device was designed using pressure sensing layer and tactile pixel array layer for separate contact force, force angle and sliding distance sensing, typically the helical patterned tactile pixel array can achieve high resolution of force angle detection. Characterization tests showed that the flexible electronic device has a wide force sensing range of 0.8 ~ 6.0 N with sensitivity of 0.358 N-1, and force angle detection resolution is 0.84. Then, the device was integrated to construct a robotic interaction system for experimental tests. The results showed that our device can simultaneously measure the contact force, force angle, and sliding displacement when finger touch the device, the tactile information can be used for the control of robotic movements. The displacement detection accuracy of robotic arm was less than 1.0 mm and time delay less than 200 ms, demonstrating our device can offer effective interactions with robots and human.


WeCT22	102A
Mechanism Design 2	Regular Session
Chair: Xia, Chongkun	Sun Yat-Sen University
Co-Chair: Xu, Jianle	Tsinghua University

15:00-15:05, Paper WeCT22.1
MuxHand: A Cost-Effective and Compact Dexterous Robotic Hand Using Time-Division Multiplexing Mechanism

Xu, Jianle	Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Luo, Hong	Tsinghua University
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Wang, Xueqian	Center for Artificial Intelligence and Robotics, Graduate School
Ding, Wenbo	Tsinghua University
Xia, Chongkun	Sun Yat-Sen University
Keywords: Mechanism Design, Multifingered Hands, Grippers and Other End-Effectors Abstract: Abstract—The number of motors directly influences the dexterity, size, and cost of a robotic hand. In this paper, we present MuxHand, a robotic hand that utilizes a time-division multiplexing motor (TDMM) mechanism. This system enables independent control of 9 cables with just 4 motors, significantly reducing both cost and size while maintaining high dexterity. To enhance stability and smoothness during grasping and manipulation tasks, we integrate magnetic joints into the three 3Dprinted fingers. These joints provide impact resistance, resetting capabilities. The three fingers together have a total of 30 degrees of freedom (DOF), 18 of which are passive DOF, allowing the hand to conform closely to the surface of an object during grasping. We conduct a series of experiments to assess the performance parameters of MuxHand, including its grasping and manipulation capabilities. The results show that the TDMM mechanism precisely controls each cable connected to the finger joints, enabling robust grasping and dexterous manipulation. Furthermore, compared to the traditional approach of assigning a motor to each active DOF, the cost is reduced by 42.06%. The maximum load of a single finger reaches 7.0 kg, the maximum load at the finger joint root is 12.0 kg, the maximum driving force at the joint root is 5.0 kg, and the maximum fingertip force is 10.0 N.

15:05-15:10, Paper WeCT22.2
A Reconfigurable Manipulator with Schönflies and RCM Motions

Xu, Tianye	Beihang University
Lyu, Shengnan	Beihang University
Ding, Xilun	Beijing Univerisity of Aeronautics and Astronautics
Keywords: Mechanism Design, Kinematics, Motion Control Abstract: This paper presents the design of a reconfigurable manipulator capable of performing Schönflies and Remote Center of Motion (RCM) operations. The Schönflies mode handles plane objects efficiently, like a SCARA robot, while the RCM mode enables the remote center operation, similar to a Da Vinci robot. Through kinematic reconfiguration, the manipulator achieves multimodal operations without component replacement, and the 1R1T module based on a spline lead screw mechanism ensures a compact structure. The kinematic reconfiguration is analyzed in this paper, and mapping rules from joint space to Cartesian space are respectively enabled in different operation modes. The concise kinematic expression without actuation redundancy simplifies control in both modes. Experimental results confirm the manipulator’s versatility in plane object handling and remote center operations, highlighting its effectiveness across diverse applications.

15:10-15:15, Paper WeCT22.3
A Compliant Tube for Series Elastic Actuators to Generate High Output Torque

Huang, Chun-Hung	National Taiwan University
Lan, Chao-Chieh	National Taiwan University
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Compliant Joints and Mechanisms Abstract: Series elastic actuators (SEAs) achieve output torque and stiffness control by managing the deformation of a spring arranged between the motor and the output. SEAs are ideal for tasks involving human-robot interaction and unstructured environments. The stiffness, size, and torque capacity of the spring are crucial for the performance of an SEA. To increase the torque capacity of an SEA while maintaining a compact size, this paper proposes a novel helical flexure as the spring in an SEA. The helical flexure is formed on the thin wall of a tube to generate high torque output with minimal reaction forces and moments. The resulting compliant tube can be combined with other transmission components to reduce the number of components and allow the inner passage of cables and shafts. Simulation comparisons and experimental testing verify the merits of the proposed helical flexure. An SEA prototype is fabricated to demonstrate the performance of the new helical flexure during zero-torque and high-torque motion.

15:15-15:20, Paper WeCT22.4
Terradynamics of Monolithic Soft Robot Driven by Vibration Mechanism

Nguyen, Viet Linh	Japan Advanced Institute of Science and Technology
Nguyen, Thanh Khoi	Japan Advanced Institute of Science and Technology (JAIST)
Ho, Van	Japan Advanced Institute of Science and Technology
Keywords: Mechanism Design, Soft Robot Materials and Design, Soft Robot Applications, Terradynamics Abstract: In this article, we present a design concept, in which a monolithic soft body is incorporated with a vibration-driven mechanism, called Leafbot. We first report a morphological design of the robot’s limbs that facilitates the forward locomotion of our vibration-driven model and enhances the capability of coping with sloped obstacles and irregular terrains. Second, the fabrication technique to achieve such a soft monolithic structure and limb morphology is fully addressed. Third, we clarify the locomotion of the Leafbot under high-frequency excitation via analytical and empirical methods in flat and even surface conditions. The maximum attained velocity in such a condition is 5 body length/second. Finally, three model designs are constructed, each featuring a different limb pattern. We examine the terradynamics characteristics of three patterns in three pre-defined conditions, i.e., the success rate of overcoming the slope, semi-circular obstacles, and step-field terrains specialized by the rugosity factor. This proposed investigation aims to build a foundation for further terradynamics study of vibration-driven soft robots in a more complicated and confined environment, with potential applications in inspection tasks.

15:20-15:25, Paper WeCT22.5
RMCC: Rigid Multi-Joint Coupled Continuum Structure for Bionic Robots

Zhou, Zida	Sun Yat-Sen University
Wu, Ying	Sun Yat-Sen University
Chen, Zujian	Shenzhen University
Bi, Zetong	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Mechanism Design, Flexible Robotics, Biomimetics Abstract: Continuum robots, inspired by biological structures such as spines and tails, have attracted significant attention due to their flexibility and ability to perform complex tasks in confined and dynamic environments. However, traditional flexible continuum robots often encounter challenges such as non-linearity, hysteresis, and limited load-bearing capacity, which can compromise their precision and effectiveness in practical applications. To address these limitations, this paper presents a novel bionic continuum mechanism: Rigid Multi-joint Coupled Continuum Structure(RMCC), which employs a rigid mechanical transmission mode to couple all joints, achieving coordinated movement of multiple joints. Its rigid structural composition and transmission method provide it with high precision and load capacity. The coordinated motion of the joints endows it with the dexterity of a continuum mechanism, while also enabling efficient and precise control with a minimal number of motors. The modular joint design improves the system's scalability and adaptability, enabling a wide range of configurations to suit diverse robotic applications. The feasibility and effectiveness of the proposed system are validated through a series of bio-inspired experiments, including lizard-like crawling, falling-cat movement, and adaptive grasping like birds. The experimental results confirm that the RMCC exhibits the flexibility and adaptability of animals, demonstrating its potential for diverse bionic robotics applications.

15:25-15:30, Paper WeCT22.6
Computational Design of Closed Linkages for Robotic Limbs

Chaikovskii, Mikhail	ITMO University
Osipov-Sigachev, Yefim	ITMO University
Zharkov, Kirill	ITMO University
Borisov, Ivan	ITMO University
Kolyubin, Sergey	ITMO University
Keywords: Mechanism Design, Kinematics, Legged Robots Abstract: Legged robots require low-inertia limbs capable of carrying a high payload. The design of such limbs poses challenges in integrating optimal kinematic structures with practical design considerations. In the search for optimal design parameters, advantages of optimization algorithms can be applied. This paper introduces an open-source framework for optimizing topology and parameters of closed linkage mechanisms, addressing the need for task-specific robotic limbs. Closed-loop structures are motivated by two main purposes: (1) to decrease robotic limb inertia by relocation of actuators close to the robot's body and (2) to redistribute efforts among actuators. The framework leverages joint-based spatial graph representations, kinetostatic criteria, and multi-objective genetic algorithms to optimize mechanism topology and parameters. Focusing on kinetostatic criteria such as Jacobian metrics and inertia properties, the framework swiftly explores the design space to balance trade-offs in robot linkages. We demonstrate the framework pipeline for the task of optimizing planar robotic legs with 2 degrees of freedom. Project github page: https://licaibeerlab.github.io/jmoves.github.io/

15:30-15:35, Paper WeCT22.7
Design Paradigm for Human Size Manipulator with High Payload, Repeatability, and Bandwidth (I)

Li, Huilai	Zhejiang University
Wang, Zezheng	Zhejiang University
Sun, Maowen	Zhejiang University
Bao, Yingwei	Zhejiang University
Ling, Zhenfei	Zhejiang University
Jiang, Haoyi	Zhejiang University
Ouyang, Xiaoping	Zhejiang University
Yang, Huayong	ZheJiang University
Keywords: Methods and Tools for Robot System Design, Mechanism Design, Parallel Robots Abstract: Manipulators applied in daily life, like exoskeletons or humanoid robots, always require excellent comprehensive performance comparable to or even surpassing that of humans to deal with diverse tasks. However, it is quite challenging to achieve all the performance simultaneously. This paper presents a design paradigm to enable manipulation systems with favorable comprehensive performance of high payload, repeatability, and bandwidth. A novel 3-RRR coaxial spherical parallel mechanism (SPM) was realized to achieve high payload and bandwidth. Structure optimization was conducted to improve the torque output performance and expand the workspace. High-performance actuators, known for their high torque and bandwidth, were designed to ensure a high upper bound of overall system performance. A linkage transmission mechanism with high stiffness was employed to ensure a high repeatability. The prototype designed under the paradigm featured human size and provided favorable performance of 10 kg payload with arm fully straight, repeatability within 0.5 mm, and 11.9 Hz for position control bandwidth.


WeCT23	102B
Path Planning for Multiple Mobile Robots or Agents 3	Regular Session
Chair: Shao, Xiaodong	Beihang University
Co-Chair: Yakovlev, Konstantin	Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences

15:00-15:05, Paper WeCT23.1
A Resource-Efficient Decentralized Sequential Planner for Spatiotemporal Wildfire Mitigation (I)

John, Josy	Indian Institute of Science
Velhal, Shridhar	Lulea Technical University
Sundaram, Suresh	Indian Institute of Science
Keywords: Path Planning for Multiple Mobile Robots or Agents, Task Planning, Search and Rescue Robots Abstract: This paper proposes a Conflict-aware Resource-Efficient Decentralized Sequential planner (CREDS) for early wildfire mitigation using multiple heterogeneous Unmanned Aerial Vehicles (UAVs). Multi-UAV wildfire management scenarios are non-stationary, with spatially clustered dynamically spreading fires, potential pop-up fires, and partial observability due to limited UAV numbers and sensing range. The objective of CREDS is to detect and sequentially mitigate all growing fires as Single-UAV Tasks (SUT) while adhering to the physical constraints of UAV. CREDS minimizes biodiversity loss through rapid UAV intervention and promotes efficient resource utilization by avoiding complex multi-UAV coordination. CREDS employs a three-phased approach, beginning with fire detection using a search algorithm, followed by local trajectory generation using the auction-based Resource-Efficient Decentralized Sequential planner (REDS), incorporating the novel non-stationary cost function, the Deadline-Prioritized Mitigation Cost (DPMC). Finally, a conflict-aware consensus algorithm resolves conflicts to determine a global trajectory for spatiotemporal mitigation. The performance evaluation of the CREDS for partial and full observability conditions with both heterogeneous and homogeneous UAV teams for different fires-to-UAV ratios demonstrates a 100% success rate for ratios up to 4 and a high success rate for the critical ratio of 5, outperforming baselines. Heterogeneous UAV teams outperform homogeneous teams in handling heterogeneous deadlines of SUT mitigation. CREDS exhibits scalability and 100% convergence, demonstrating robustness against potential deadlock assignments, enhancing its success rate compared to the baseline approaches.

15:05-15:10, Paper WeCT23.2
PC²P: Multi-Agent Path Finding Via Personalized-Enhanced Communication and Crowd Perception

Li, Guotao	Institute of Microelectronics of the Chinese Academy of Sciences
Xu, Shaoyun	Institute of Microelectronics of the Chinese Academy of Sciences
Hao, Yuexing	Institute of Microelectronics of the Chinese Academy of Sciences
Wang, Yang	The Institute of Microelectronics of the Chinese Academy of Scie
Sun, Yuhui	Institute of Microelectronics of the Chinese Academy of Sciences
Keywords: Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning, Deep Learning Methods Abstract: Distributed Multi-Agent Path Finding (MAPF) integrated with Multi-Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real-time cooperative decision-making in partially observable environments through inter-agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC²P, a novel distributed MAPF method derived from a Q-learning-based MARL framework. Initially, we introduce a personalized-enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of who and what in interactive process through three-stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents' heuristic observation, thereby strengthening the model's guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region-based deadlock-breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC²P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.

15:10-15:15, Paper WeCT23.3
Sampling-Based Path Planning for Tethered Robot Chains

Jin, Zeyuan	Arizona State University
Xue, Xingjian	Northeastern University
Stoffel, Josh	Arizona State University
Yong, Sze Zheng	Northeastern University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Collision Avoidance Abstract: Motivated by human-chains in rescue missions, this paper proposes a scalable path planning algorithm for multiple mobile robots that are tethered to one another in a chain topology with finite-length tethers. Specifically, our approach trades off optimality for scalability and computational tractability by adding some simplifying, yet realistic constraints that can significantly reduce computation. In particular, by maintaining the existence of tether configurations that coincide with collision-free, feasible paths for the robots, we remove the need to check that the tether configurations are collision-free, which is often a bottleneck since the tethers are infinite-dimensional. Our proposed path planning framework for tethered robot chains builds upon sampling-based algorithms such as RRT, BIT, and ABIT*. Finally, we prove the probabilistic completeness of the approach, ensuring reliable path generation, and demonstrate the effectiveness of our approach in simulation experiments.

15:15-15:20, Paper WeCT23.4
Attention-Based Higher-Order Reasoning for Implicit Coordination of Multi-Robot Systems

Reasoner, Jonathan	University of Virginia
Bramblett, Lauren	University of Virginia
Bezzo, Nicola	University of Virginia
Keywords: Path Planning for Multiple Mobile Robots or Agents, Task and Motion Planning, Multi-Robot Systems Abstract: This paper presents a novel theory of mind (ToM)-based approach for implicit coordination of multi robot systems (MRS) in environments where direct communication is unavailable. The proposed approach integrates higher-order reasoning, epistemic theory, and active inference to coordinate the actions of each robot to clarify their own intentions and make them understandable to other robots. Further, to reduce the computational overhead of higher-order reasoning, we implement a large language model (LLM)-based attention selection mechanism that focuses on a subset of robots. Simulations and physical experiments demonstrate the applicability of the proposed approach with high success rates while significantly reducing computation complexity.

15:20-15:25, Paper WeCT23.5
Modular Decision-Making and Drivable Areas for Multi-Agent Autonomous Racing

Toschi, Alessandro	University of Modena and Reggio Emlilia
Prignoli, Francesco	University of Modena and Reggio Emilia
Marko, Bertogna	Unimore
Keywords: Path Planning for Multiple Mobile Robots or Agents, Autonomous Vehicle Navigation, Collision Avoidance Abstract: This paper presents an interaction-aware, modular framework for local trajectory planning in autonomous driving, particularly suited for multi-agent racing scenarios. Our framework first identifies viable drivable areas (tunnels), taking into account predictions of other agents’ behaviors, and subsequently utilizes a high-level decision-making module to select the optimal corridor considering both static and moving vehicles. This decision-making module also strategically determines when to follow an opponent or initiate an overtaking maneuver, while ensuring compliance with racing regulations. A Model Predictive Control (MPC) module is then employed to compute an optimal, collision-free trajectory within the chosen corridor. The proposed modular architecture simplifies the computational complexity typically associated with MPC optimization and facilitates independent component testing. Simulations and real-world tests on various racing tracks demonstrate the efficacy of our approach, even in highly dynamic interactive scenarios with multiple simultaneous opponents; videos of these and additional experiments are available at https://atoschi.github.io/tunnels-framework/.

15:25-15:30, Paper WeCT23.6
MultiNash-PF: A Particle Filtering Approach for Computing Multiple Local Generalized Nash Equilibria in Trajectory Games

Bhatt, Maulik	University of California, Berkeley
Askari, Iman	University of Kansas
Yu, Yue	University of Minnesota
Topcu, Ufuk	The University of Texas at Austin
Fang, Huazhen	University of Kansas
Mehr, Negar	University of California Berkeley
Keywords: Path Planning for Multiple Mobile Robots or Agents, Optimization and Optimal Control, Human-Aware Motion Planning Abstract: Modern robotic systems frequently engage in complex multi-agent interactions, many of which are inherently multi-modal, i.e., they can lead to multiple distinct outcomes. To interact effectively, robots must recognize the possible interaction modes and adapt to the one preferred by other agents. In this work, we propose MultiNash-PF, an efficient algorithm for capturing the multimodality in multi-agent interactions. We model interaction outcomes as equilibria of a game-theoretic planner, where each equilibrium corresponds to a distinct interaction mode. Our framework formulates interactive planning as Constrained Potential Trajectory Games (CPTGs), in which local Generalized Nash Equilibria (GNEs) represent plausible interaction outcomes. We propose to integrate the potential game approach with implicit particle filtering, a sample-efficient method for non-convex trajectory optimization. We utilize implicit particle filtering to identify the coarse estimates of multiple local minimizers of the game's potential function. MultiNash-PF then refines these estimates with optimization solvers, obtaining different local GNEs. We show through numerical simulations that MultiNash-PF reduces computation time by up to 50% compared to a baseline. We further demonstrate the effectiveness of our algorithm in real-world human-robot interaction scenarios, where it successfully accounts for the multi-modal nature of interactions and resolves potential conflicts in real-time.

15:30-15:35, Paper WeCT23.7
Cooperative Multi-Robot Path Finding with Removable Obstacles for Autonomous Environment Modification

Kiruthika, Usha	National Institute of Technology Tiruchirappalli
Fung, Wai-keung	Cardiff Metropolitan University
Erraguntla, Abhinav	National Institute of Technology Tiruchirappalli
S, Soma Vignesh	National Institute of Technology Tiruchirappalli
S, Krupasagar Reddy	National Institute of Technology Tiruchirappalli
Keywords: Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots, Multi-Robot Systems Abstract: Multi-Robot Navigation Among Movable Obstacles (MR-NAMO) is a variant of the Multi-Agent Path Finding (MAPF) problem where the environment consists of both immovable and movable obstacles. In this paper, we introduce a special case of MR-NAMO called the multi-agent path finding with removable obstacles (MAPF-RO) problem in which the robots cooperate to remove some obstacles along the busy paths in the environment. We model the removable obstacles as pits and remove them by filling them using sandbags. Sandbags are modeled as movable obstacles but are removable when they are filled into a pit. The obstacles are removed or moved away from the paths while the total energy required for all robots to reach their goals is minimized. The nearby sandbag to fill a pit is identified by using a kd-tree-based heuristic search. The nearby robot to push a sandbag is identified using directional wavefront propagation algorithm. We simulate the scenario in randomized grid environments consisting of static, movable and removable obstacles. We find that our approach conserves energy by removing only the necessary obstacles and cooperatively shortens the path for other agents. This method can be applied to multi-robot cooperative environment modification, enabling robots to alter their surroundings to optimize task execution for their peers while reducing overall energy expenditure.

15:35-15:40, Paper WeCT23.8
Decentralized Uncertainty-Aware Multi-Agent Collision Avoidance with Model Predictive Path Integral

Dergachev, Stepan	HSE University
Yakovlev, Konstantin	Federal Research Center "Computer Science and Control" of the Ru
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems Abstract: Decentralized multi-agent navigation under uncertainty is a complex task that arises in numerous robotic applications. It requires collision avoidance strategies that account for both kinematic constraints, sensing and action execution noise. In this paper, we propose a novel approach that integrates the Model Predictive Path Integral (MPPI) with a probabilistic adaptation of Optimal Reciprocal Collision Avoidance. Our method ensures safe and efficient multi-agent navigation by incorporating probabilistic safety constraints directly into the MPPI sampling process via a Second-Order Cone Programming formulation. This approach enables agents to operate independently using local noisy observations while maintaining safety guarantees. We validate our algorithm through extensive simulations with differential-drive robots and benchmark it against state-of-the-art methods, including ORCA-DD and B-UAVC. Results demonstrate that our approach outperforms them while achieving high success rates, even in densely populated environments. Additionally, validation in the Gazebo simulator confirms its practical applicability to robotic platforms. A source code is available at: http://github.com/PathPlanning/MPPI-Collision-Avoidance.


WeCT24	102C
Computer Vision for Transportation 1	Regular Session
Chair: Wen, Congcong	New York University Abu Dhabi
Co-Chair: Zhao, Hao	Tsinghua University

15:00-15:05, Paper WeCT24.1
VaLID: Verification As Late Integration of Detections for LiDAR-Camera Fusion

Vats, Vanshika	University of California Santa Cruz
Nizam, Marzia Binta	University of California Santa Cruz
Davis, James	UC Santa Cruz
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Object Detection, Segmentation and Categorization Abstract: Vehicle object detection benefits from both LiDAR and camera data, with LiDAR offering superior performance in many scenarios. Fusion of these modalities further enhances accuracy, but existing methods often introduce complexity or dataset-specific dependencies. In our study, we propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a lightweight neural verification network trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 3D average precision (3DAP). Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.

15:05-15:10, Paper WeCT24.2
Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

Do, Tuong	AIOZ
Nguyen, Binh	AIOZ
Tran, Quang	AIOZ
Tjiputra, Erman	AIOZ
Chiu, Te-Chuan	National Tsing Hua University
Nguyen, Anh	University of Liverpool
Keywords: Computer Vision for Transportation, Visual Learning Abstract: Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method. Our source code can be found at: https://github.com/aioz-ai/LTFed

15:10-15:15, Paper WeCT24.3
MambaMap: Online Vectorized HD Map Construction Using State Space Model

Yang, Ruizi	Zhejiang University
Liu, Xiaolu	Zhejiang University
Chen, Junbo	Udeer AI
Zhu, Jianke	Zhejiang University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Mapping Abstract: High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at https://github.com/ZiziAmy/MambaMap.

15:15-15:20, Paper WeCT24.4
Knowledge Distillation for Semantic Segmentation: A Label Space Unification Approach

Backhaus, Anton	University of the Bundeswehr Munich
Luettel, Thorsten	Universität Der Bundeswehr München
Mirko, Maehlisch	University of the Bundeswehr Munich
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: An increasing number of datasets sharing similar domains for semantic segmentation have been published over the past few years. But despite the growing amount of overall data, it is still difficult to train bigger and better models due to inconsistency in taxonomy and/or labeling policies of different datasets. To this end, we propose a knowledge distillation approach that also serves as a label space unification method for semantic segmentation. In short, a teacher model is trained on a source dataset with a given taxonomy, then used to pseudo-label additional data for which ground truth labels of a related label space exist. By mapping the related taxonomies to the source taxonomy, we create constraints within which the model can predict pseudo-labels. Using the improved pseudo-labels we train student models that consistently outperform their teachers in two challenging domains, namely urban and off-road driving. Our ground truth-corrected pseudo-labels span over 12 and 7 public datasets with 388.230 and 18.558 images for the urban and off-road domains, respectively, creating the largest compound datasets for autonomous driving to date.

15:20-15:25, Paper WeCT24.5
ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition against Weather Corruptions

Kuang, Wenqing	National University of Defense Technology
Zhao, Xiongwei	Harbin Institute of Technology
Shen, Yehui	NorthEast University
Wen, Congcong	New York University Abu Dhabi
Lu, Huimin	National University of Defense Technology
Zhou, Zongtan	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Keywords: Deep Learning Methods, Computer Vision for Transportation Abstract: LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient and lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, the first benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.

15:25-15:30, Paper WeCT24.6
Moving Object Segmentation Via 3D LiDAR Data: A Learning-Free Real-Time Online Alternative

Yi, Zinuo	Technical University of Munich
Neumann, Felix	Siemens AG
v. Wichert, Georg	Siemens AG
Burschka, Darius	Technische Universitaet Muenchen
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, Visual Tracking Abstract: Motion detection in 3D LiDAR is crucial for autonomous systems. While deep learning dominates Moving Object Segmentation (MOS), the potential of learning-free approaches remains underexplored. Unlike problems like semantic segmentation, motion can be explicitly modeled, potentially enabling efficient, interpretable, and computationally lightweight solutions. Motivated by this, we introduce a novel real-time, online, learning-free MOS method. We propose the novel Join Count Feature to extract motion cues from a local window of range images, and long-term filtering with efficient two-step association to enhance accuracy. Compared to learning-based models, we achieve superior precision and competitive IoU for saliently moving objects on SemanticKITTI. Further evaluation on HeLiMOS demonstrate stronger generalization by the proposed method across different LiDAR sensors. These results highlight the potential of learning-free methods for motion detection in 3D LiDAR data.

15:30-15:35, Paper WeCT24.7
Enhanced Motion Forecasting with Plug-And-Play Multimodal Large Language Models

Luo, Katie	Cornell University
Ji, Jingwei	Waymo LLC
He, Tong	Google Deepmind
Xu, Runsheng	UCLA
Xie, Yichen	University of California, Berkeley
Anguelov, Dragomir	Waymo
Tan, Mingxing	Waymo Research
Keywords: Computer Vision for Transportation, Computer Vision for Automation, Autonomous Vehicle Navigation Abstract: Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning—making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

15:35-15:40, Paper WeCT24.8
CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios Using Gaussian Splatting

Xu, Haoran	Beijing Institute of Technology
Zhang, Saining	Nanyang Technological University
Li, Peishuo	Nanyang Technological University
Ye, Baijun	Tsinghua University
Chen, Xiaoxue	Tsinghua University
Gao, Huan-ang	Tsinghua University
Zheng, Jv	Lightwheel.AI
Song, Xiaowei	Tongji University
Peng, Ziqiao	Renmin University of China
Miao, Run	Beijing University of Technology
Jinrang, Jia	Baidu Inc
Shi, Yifeng	BAIDU.INC
Yi, Guangqi	Baidu Inc
Zhao, Hang	Tsinghua University
Tang, Hao	Peking University
Li, Hongyang	University of Hong Kong
Yu, Kaicheng	Westlake University
Zhao, Hao	Tsinghua University
Keywords: Computer Vision for Transportation, Simulation and Animation Abstract: Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction and synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while enabling flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure perspectives, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; (2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking in the V2X-Seq benchmark; and (3) CRUISE effectively generates challenging corner cases.


WeCT25	103A
Legged Robots 3 - Control	Regular Session
Chair: Cheng, Hui	Sun Yat-Sen University
Co-Chair: Asano, Fumihiko	Japan Advanced Institute of Science and Technology

15:00-15:05, Paper WeCT25.1
Enhancing the Flexibility of a Quadruped Robot with a 2-DOF Active Spine Using Nonlinear Model Predictive Control

Yang, Zeyi	Sun Yat-Sen University
Xu, Zhiyong	Sun Yat-Sen University
Rong, Haomin	Sun Yat-Sen University
Mo, Shaolin	Sun Yat-Sen University
Chen, Yuying	Sun Yat-Sen University
Chen, Zujian	Shenzhen University
Wang, Tao	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Legged Robots, Biologically-Inspired Robots, Optimization and Optimal Control Abstract: For quadrupeds, a flexible spine allows them to traverse space and make quick turns. From the perspective of mechanical design in quadruped robots, an active spine with 2 degrees of freedom (2-DOF) can achieve dynamic posture adjustment similar to biological organisms which allows for pitch and yaw control. In this work, we present a novel approach to enhance the flexibility of a quadruped robot, Yat-sen Lion II, by incorporating a 2-DOF active spine, which is mechanically designed as a linkage-driven parallelogram mechanism. To optimize its motion, we utilize nonlinear model predictive control (NMPC), which combines centroidal dynamics with full kinematics. By incorporating the two extra DOFs of the spinal joint into the generalized coordinates and velocities, we represent the robot as a hybrid dynamic system, capturing the intricate interplay between the legs and spine. Centroidal dynamics act as a crucial bridge between joint movements and the robot’s overall momentum, enabling the controller to synchronize the quadruped’s movements with dynamic spinal adjustments and adaptive gait patterns. We validate our approach through both simulation and real-world experiments. We compare spinal quadruped robot to their rigid-spine counterparts across key locomotion metrics, including in-place turning, straight-line speed, and turning radius. The results indicate that the spined quadrupedal robot outperforms its rigid counterpart by up to 26%, highlighting its flexibility.

15:05-15:10, Paper WeCT25.2
Learning to Traverse Challenging Terrain Using Vision and Forward Kinematics

Dong, Jiajun	Tianjin University
Xu, Yanbin	School of Electrical and Information Engineering, Tianjin Univer
Ren, Chao	Tianjin University
Mu, Chaoxu	Tianjin University
Dong, Feng	Tianjin University
Keywords: Legged Robots, Reinforcement Learning Abstract: In this letter, we propose a new method for visual locomotion controller in quadruped robots, aimed at enhancing their capability to traverse challenging terrain. Our approach integrates computer vision techniques with robust locomotion control to improve terrain traversal. To facilitate terrain perception, we use onboard cameras and body sensors to collect real-world visual and proprioceptor data, and utilize forward kinematics to convert joint angles into precise foot positions. This enables accurate estimation of terrain height, which serves as supervised training data for our visual motion controller. This integrated approach improves the robot's ability to anticipate and adapt to diverse terrain conditions, potentially advancing quadruped locomotion in unstructured environments. Our model is deployed on A1 robot from Unitree. Experimental results show that our proposed method can enable the robot to stably climb stairs and pass through sand, grass, snow, and uneven roads.

15:10-15:15, Paper WeCT25.3
Analysis of Compliant Torso Vibration on Passive Quadruped Walkers

Xiang, Yuxuan	Japan Advanced Institute of Science and Technology
Zheng, Yanqiu	Ritsumeikan University
Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Tokuda, Isao T.	Ritsumeikan University
Keywords: Legged Robots, Passive Walking, Biologically-Inspired Robots Abstract: Quadrupedal locomotion involves coordinated interaction between limbs and torso, enabling them to achieve remarkable movement performance and adapt effectively to various environments. In previous studies, mathematical dynamic models of quadrupeds have been established to investigate the mechanisms of limb-torso interaction during walking. However, due to the strong nonlinearity within the model, analyzing how the torso's motion, especially vibrations, affects walking remains a significant challenge. In this study, the linearization and frequency analysis methods are applied to the quadruped walker to analyze its vibration characteristics, including natural frequency and vibration amplitude. Subsequently, numerical simulations are conducted to examine the relationship between torso vibration and walking performance. Furthermore, a comparison between the vibration characteristics and the simulation results reveals a potential resonance phenomenon. This finding not only validates the effectiveness of the linearization approach but also offers new insights into the interaction between the limbs and torso.

15:15-15:20, Paper WeCT25.4
Unified Locomotion Transformer with Simultaneous Sim-To-Real Transfer for Quadrupeds

Liu, Dikai	NVIDIA
Zhang, Tianwei	Nanyang Technological University
Yin, Jianxiong	NVIDIA
See, Simon	NVIDIA
Keywords: Legged Robots, Machine Learning for Robot Control Abstract: Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.

15:20-15:25, Paper WeCT25.5
Generalized Locomotion in Out-Of-Distribution Conditions with Robust Transformer

Guo, Lingxiao	Shanghai Jiao Tong University
Gao, Yue	Shanghai JiaoTong University
Keywords: Legged Robots, Sensorimotor Learning, Reinforcement Learning Abstract: To succeed in the real world, robots must deal with situations that differ from those seen during training. Those out-of-distribution situations for legged robot mainly include challenging dynamic gaps and perceptual gaps. Here we study the problem of robust locomotion in such novel situations. While previous methods usually rely on designing elaborate trainin and adaptation techniques, we approach the problem from a network model perspective. Our approach, RObust Locomotion Transformer(ROLT),a variation of transformer,could achieve robustness in a variety of unseen conditions. ROLT introduces two key designs: body tokenization and consistent dropout. Body tokenization supports knowledge share across different limbs, which boosts generalization ability of the network. Meanwhile, a novel dropout strategy enhances the policy’s robustness to unseen perceptual noise. We conduct extensive experiments both on quadruped and hexapod robots. Results demonstrate that ROLT is more robust than existing methods. Although trained in only a few dynamic settings, the learned policy generalizes well to multiple unseen dynamic conditions. Additionally, despite training with clean observations, the model handles challenging corruption noise during testing.

15:25-15:30, Paper WeCT25.6
QuietPaw: Learning Quadrupedal Locomotion with Versatile Noise Preference Alignment

Zhang, Yuyou	Carnegie Mellon University
Yao, Yihang	CMU
Liu, Shiqi	Carnegie Mellon University
Niu, Yaru	Carnegie Mellon University
Lin, Changyi	Carnegie Mellon University
Yang, Yuxiang	Google Deepmind
Yu, Wenhao	Google
Zhang, Tingnan	Google
Tan, Jie	Google
Zhao, Ding	Carnegie Mellon University
Keywords: Legged Robots, Reinforcement Learning Abstract: When operating in their full capacity, quadrupedal robots can produce loud footstep noise, which can be disruptive in human-centered environments like homes, offices, and hospitals. As a result, balancing locomotion performance with noise constraints is crucial for the successful real-world deployment of quadrupedal robots. However, achieving adaptive noise control is challenging due to (a) the trade-off between agility and noise minimization, (b) the need for generalization across diverse deployment conditions, and (c) the difficulty of effectively adjusting policies based on noise requirements. We propose QuietPaw, a framework incorporating our Conditional Noise-Constrained Policy (CNCP), a constrained learning-based algorithm that enables flexible, noise-aware locomotion by conditioning policy behavior on noise-reduction levels. We leverage value representation decomposition in the critics, disentangling state representations from condition-dependent representations and this allows a single versatile policy to generalize across noise levels without retraining while improving the Pareto trade-off between agility and noise reduction. We validate our approach in simulation and the real world, demonstrating that CNCP can effectively balance locomotion performance and noise constraints, achieving continuously adjustable noise reduction.

15:30-15:35, Paper WeCT25.7
KD-RIEKF: Kinodynamic Right-Invariant EKF for Legged Robot State Estimation

Yang, Qi	Tsinghua University
Lan, Bin	Jianghuai Advanced Technology Center
Chen, Bingjie	Tsinghua
Wang, Jingjing	Jianghuai Advance Technology Center
Cheng, Yi	Tsinghua University
Li, Yizhe	Tsinghua University
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Liang, Bin	Center for Artificial Intelligence and Robotics, Graduate School
Keywords: Legged Robots, Dynamics, Kinematics Abstract: We present KD-RIEKF, a novel state estimation framework that incorporates kinodynamic constraints into the Right-Invariant Extended Kalman Filter (RIEKF). Our framework integrates generalized momentum-based contact estimation, centroidal dynamics, and a noise-adaptive module,improving state estimation accuracy by probabilistically adjusting propagation noise to account for contact uncertainty and sensor noise. A key innovation is the expansion of the ground reaction force (GRF) into a state variable. By using GRF-based acceleration as a measurement, our method significantly reduces estimation errors in position, velocity, and orientation. The integration of contact-force-driven adaptive noise effectively boosts the stability of estimation, especially when the system is undergoing turning, acceleration, or deceleration processes. We validated our algorithm in simulation on highly uneven terrain, showing significant enhancements in z-axis position estimation compared to RIEKF. Further experiments on the Unitree Go2 robot across different speeds demonstrated that even in high-speed scenarios over 200 meters, our method reduced position estimation relative error (RE) by 47% and orientation estimation by 42%, confirming its robustness and accuracy under dynamic locomotion.


WeCT26	103B
Localization 3	Regular Session
Chair: Kolyubin, Sergey	ITMO University
Co-Chair: Su, Shijian	Huaqiao University

15:00-15:05, Paper WeCT26.1
ORA-NET: Enhancing Image Feature Matching through Oriented Overlapping Region Alignment

Cui, Te	Beijing Institute of Technology
Wang, Meiling	Beijing Institute of Technology
Chen, Guangyan	Beijing Institute of Technology
Yu, Meng	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Localization, Deep Learning for Visual Perception Abstract: Image feature matching is a fundamental task in computer vision. Existing local feature matching methods can establish robust correspondences between image pairs. However, these methods heavily rely on dense local image features, making them susceptible to significant perspective differences, characterized by rotation and scale changes. To alleviate this limitation, we introduce a novel oriented Overlapping Region Alignment method, named ORA-NET, which presents a concise and efficient approach to enhance the performance of image feature matching methods. We introduce the Multidirectional Cross-scale Feature Aggregation module to aggregate rotation-equivariant features across multiple scales and model long-range dependencies. Additionally, the Oriented Overlap Alignment module estimates scale and rotation differences within overlapping regions using a coarse-to-fine rotation correction approach. Importantly, our method serves as a plug-and-play module that can be seamlessly integrated into other correspondence matching pipelines. Experimental results demonstrate that ORA-NET significantly enhances the matching performance of existing local feature matching methods, particularly in scenarios involving substantial perspective differences.

15:05-15:10, Paper WeCT26.2
GOTPR: General Outdoor Text-Based Place Recognition Using Scene Graph Retrieval with OpenStreetMap

Jung, Donghwi	Seoul National University
Kim, Keonwoo	Seoul National University
Kim, Seong-Woo	Seoul National University
Keywords: Localization, Deep Learning Methods Abstract: We propose GOTPR, a robust place recognition method designed for outdoor environments where GPS signals are unavailable. Unlike existing approaches that use point cloud maps, which are large and difficult to store, GOTPR leverages scene graphs generated from text descriptions and maps for place recognition. This method improves scalability by replacing point clouds with compact data structures, allowing robots to efficiently store and utilize extensive map data. In addition, GOTPR eliminates the need for custom map creation by using publicly available OpenStreetMap data, which provides global spatial information. We evaluated its performance using the KITTI360Pose dataset with corresponding OpenStreetMap data, comparing it to existing point cloud-based place recognition methods. The results show that GOTPR achieves comparable accuracy while significantly reducing storage requirements. In city-scale tests, it completed processing within a few seconds, making it highly practical for real-world robotics applications.. More information can be found at https://donghwijung.github.io/GOTPR_page/.

15:10-15:15, Paper WeCT26.3
MetaSonic: Advancing Robot Localization with Directional Embedded Acoustic Signals

Wang, Junling	Shanghai Jiao Tong University
An, Zhenlin	University of Pittsburgh
Guo, Yi	Shanghai Jiao Tong University
Keywords: Localization, Robot Audition, Deep Learning Methods Abstract: Indoor positioning in environments where GPS cannot be used is a fundamental technology for robotics navigation and human-robot interaction. However, existing vision-based localization systems cannot work in low-visibility environments, and existing wireless or acoustic localization systems require specific transceivers, making them expensive and power-intensive — particularly challenging for micro-robots. This paper proposes a new metasurface-assisted ultrasound positioning system. The key idea is to use a low-cost passive acoustic metasurface to transfer any speaker into a directional ultrasound source, with the acoustic spectrum varying based on direction. This allows any microrobot with a simple, low-cost microphone to capture such modified sound to identify the direction of the sound source. We develop a lightweight convolutional neural network-based localization algorithm that can be efficiently deployed on low-power microcontrollers. We evaluate our system in a large complex office. It can achieve a direction estimation accuracy of 7.26°, improving by 42.2% compared to systems without the metasurface and matching the performance of a 4-microphone array, with a localization accuracy of 0.35 m.

15:15-15:20, Paper WeCT26.4
GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

Sidorov, Gennady	Saint Petersburg National Research University of Information Tec
Mohrat, Malik	ITMO University
Gridusov, Denis	ITMO University
Rakhimov, Ruslan	T-Tech
Kolyubin, Sergey	ITMO University
Keywords: Localization, Mapping, Visual Learning Abstract: Visual localization methods often present a trade-off between the high efficiency of specialized approaches, such as scene coordinate regression, and the need for rich, versatile scene representations for broader robotics tasks. To bridge this gap, we explore the use of 3D Gaussian Splatting (3DGS), which enables a unified, photorealistic encoding of 3D geometry and appearance. We propose GSplatLoc, a self-contained framework that tightly integrates structure-based keypoint matching with rendering-based pose refinement. Our two-stage procedure first distills robust descriptors from the lightweight XFeat extractor into the 3DGS model, enabling coarse pose estimation via 2D-3D correspondences without external dependencies. In the second stage, the initial pose is refined by minimizing a photometric warp loss, which leverages the fast, differentiable rendering of 3DGS. Benchmarking on widely used indoor and outdoor datasets demonstrates state-of-the-art performance among neural rendering-based localization methods and highlights the framework's robustness in challenging dynamic scenes. Project page: https://gsplatloc.github.io

15:20-15:25, Paper WeCT26.5
Semantic-Geometric Triple Constraint-Based Graph Matching for Robust Multi-Robot Global Localization in Large-Scale Environments

Wang, Fan	Hefei Institutes of Physical Science, Chinese Academy of Science
Zhang, Chaofan	Chinese Academy of Sciences
Zhang, Wen	Institute of Applied Technology, Hefei Institutes of Physical Sc
Yingwei, Xia	Chinese Academy of Sciences
Liu, Yong	Institute of Applied Technology, Hefei Institutes of Physical Sc
Keywords: Localization, Multi-Robot Systems, Recognition Abstract: Vision-based multi-robot global localization in large-scale environments is a challenging task due to real-time environmental changes, which complicate data association between viewpoints. Recently, researchers have proposed methods to enhance viewpoint invariance by organizing object semantic information using graph structures. However, previous works still face limitations in accuracy and robustness in real-world scenarios, primarily due to the limited descriptive power of feature information (semantic or geometric) and susceptibility to noise. In this paper, we propose a novel semantic-geometric triple constraint-based graph matching multi-robot global localization method, called SGT-MGL. We first extract the semantic and geometric features of objects and describe the distribution of neighboring objects using topological structures. To improve object discriminability, we construct a triangular descriptor for each object based on semantic information and relative distances. Considering the complementarity between semantic and geometric information, we introduce a 3D histogram descriptor that encodes semantic, spatial angular, and relative distance information, enhancing the invariance of the triangular descriptor. To further mitigate noise, we propose a candidate point selection strategy guided by global geometric structures and employ a combined local and global graph matching approach for 6-DOF pose estimation. We extensively evaluate SGT-MGL on three public datasets, demonstrating superior accuracy and robustness. The implementation of SGT-MGL will be available.

15:25-15:30, Paper WeCT26.6
VIPeR: Visual Incremental Place Recognition with Adaptive Mining and Continual Learning

Ming, Yuhang	Hangzhou Dianzi University
Xu, Minyang	HangzhouDianzi Univerisity
Yang, Xingrui	CARDC
Ye, Weicai	Zhejiang University
Wang, Weihan	Stevens Institute of Technology
Peng, Yong	Hangzhou Dianzi University
Dai, Weichen	Hangzhou Dianzi University
Kong, Wanzeng	Hangzhou Dianzi University
Keywords: Localization, Recognition, Continual Learning Abstract: Visual place recognition (VPR) is essential to many autonomous systems. Existing VPR methods demonstrate attractive performance at the cost of limited generalizability. When deployed in unseen environments, these methods exhibit significant performance drops. Targeting this issue, we present VIPeR, a novel approach for visual incremental place recognition with the ability to adapt to new environments while retaining the performance of previous ones. We first introduce an adaptive mining strategy that balances the performance within a single environment and the generalizability across multiple environments. Then, to prevent catastrophic forgetting in lifelong learning, we design a novel multi-stage memory bank for explicit rehearsal. Additionally, we propose a probabilistic knowledge distillation to explicitly safeguard the previously learned knowledge. We evaluate our proposed VIPeR on three large-scale datasets---Oxford Robotcar, Nordland, and TartanAir. For comparison, we first set a baseline performance with naive finetuning. Then, several more recent lifelong learning methods are compared. Our VIPeR achieves better performance in almost all aspects with the biggest improvement of 13.85% in average performance.

15:30-15:35, Paper WeCT26.7
ALCDNet: Loop Closure Detection Based on Acoustic Echoes

Liu, GuangYao	Zhejiang University
Cui, Weimeng	Zhejiang University
Jia, Naizheng	Zhejiang University
Xi, Yuzhang	Zhejiang University
Li, Shuyu	Zhejiang University
Wang, Zhi	Zhejiang University
Keywords: Localization, Sensor Fusion Abstract: Loop closure detection is a critical component of simultaneous localization and mapping (SLAM) systems, essential for mitigating the drift that accumulates over time. Traditional approaches utilizing light detection and ranging (LiDAR) and cameras have been developed to address this challenge. However, these methods can be ineffective when there is a lack of visual cues, such as smoke, poor lighting conditions, and textureless environments. In this letter, we propose an efficient loop closure detection method that employs a speaker and microphone array to gather spatial structure information. First, our method uses a microphone array to capture echoes from finely designed signals emitted by the speaker. Second, we apply momentum contrastive learning (MoCo) to train an echo feature encoder to learn the implicit spatial features embedded in the echo signals. Finally, loop closure detection is performed by computing the cosine similarity of features output by the encoding network from echo information at different locations. Experiments conducted in typical indoor environments demonstrate that our method outperforms vision-based methods in most cases and can still achieve accurate loop closure detection in smoky environments where both LiDAR and vision-based methods fail. This makes it a viable and cost-effective complementary solution in environments with sparse texture features, unstable lighting conditions or smoke. Code will be available at https://github.com/zjuersdsd/ALCDNet.

15:35-15:40, Paper WeCT26.8
A Wearable, Reconfigurable, and Modular Magnetic Tracking System for Wireless Capsule Robots (I)

Su, Shijian	Huaqiao University
Yuan, Sishen	The Chinese University of Hong Kong
Li, Zhen	Qilu Hospital of Shandong University
Ma, Yan	The Chinese University of Hong Kong
Ma, Miaomiao	Qilu Hostital of Shandong University
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Localization, Sensor Fusion, Wearable Robotics Abstract: Wearable magnetic tracking systems (MTSs) offer a promising technology for the long-term tracking of wireless-capsule robots within the digestive tract. However, existing wearable MTSs are fixed in size and cannot accommodate patients with diverse abdominal circumferences. To address this limitation, we propose a wearable and reconfigurable MTS. First, we design a reconfigurable sensor array inspired by the structure of bamboo slips, allowing it to conform to the abdominal surface and accommodate individuals with different abdominal circumferences. Next, we formulate a magnetic tracking optimization problem based on the magnetic dipole model and our established kinematic model of the reconfigurable sensor array. Solving the magnetic tracking problem, we achieved outstanding localization accuracy of 1.44±0.50 mm and 1.07±0.16∘. Experimental validation demonstrates our proposed system's portability, reconfigurability, and adaptability to varying abdominal circumferences, offering valuable technological means for diagnosing and treating gastrointestinal disorders.


WeCT27	103C
Planning, Scheduling and Coordination 1	Regular Session
Chair: Singh, Arun Kumar	University of Tartu
Co-Chair: Street, Charlie	University of Birmingham

15:00-15:05, Paper WeCT27.1
Trajectory Optimization under Stochastic Dynamics Leveraging Maximum Mean Discrepancy

Sharma, Basant	University of Tartu
Singh, Arun Kumar	University of Tartu
Keywords: Planning under Uncertainty, Motion and Path Planning, Collision Avoidance Abstract: This paper addresses sampling-based trajectory optimization for risk-aware navigation under stochastic dynamics. Typically such approaches operate by computing N_tilde perturbed rollouts around the nominal dynamics to estimate the collision risk associated with a sequence of control commands. We consider a setting where it is expensive to estimate risk using perturbed rollouts, for example, due to expensive collision-checks. We put forward two key contributions. First, we develop an algorithm that distills the statistical information from a larger set of rollouts to a reduced-set with sample size N where N < < N_tilde. Consequently, we estimate collision risk using just N rollouts instead of N_tilde. Second, we formulate a novel surrogate for the collision risk that can leverage the distilled statistical information contained in the reduced-set. We formalize both algorithmic contributions using distribution embedding in Reproducing Kernel Hilbert Space (RKHS) and Maximum Mean Discrepancy (MMD). We perform extensive benchmarking to demonstrate that our MMD-based approach leads to safer trajectories at low sample regime than existing baselines using Conditional Value-at-Risk (CVaR) based collision risk estimate.

15:05-15:10, Paper WeCT27.2
An Adaptive ROS2 Node Deployment Framework in Mobile Edge-Robot Systems

Yang, Xincheng	China Agricultural University
Hu, Biao	China Agricultural University
Keywords: Planning, Scheduling and Coordination, Process Control, Performance Evaluation and Benchmarking Abstract: Mobile edge computing is an emerging computing paradigm that enhances the computational capabilities of mobile devices by offloading intensive tasks to edge servers. In robotic systems, MEC can significantly reduce response times and improve user experience. However, as robots move through their environments, factors such as the distance to edge servers and physical obstructions fluctuate, leading to variations in communication bandwidth and, consequently, communication delays. To address these challenges, this paper proposes an adaptive computing node offloading framework (ARDF) designed to optimize the dynamic deployment of Robot Operating System 2 (ROS2) nodes in robot-edge environments. The framework enables developers to flexibly deploy robotic computing tasks based on varying computational and network conditions. We validate its effectiveness through experiments involving robotic arm control,3D detection applications, and numerically simulated ROS2 tasks under different bandwidth conditions. The results demonstrate that the framework significantly improves response times for ROS2 applications, even under fluctuating computational loads and network constraints. The code for the framework ARDF can be found on GitHub.

15:10-15:15, Paper WeCT27.3
ConvoyLLM: Dynamic Multi-Lane Convoy Control Using LLMs

Lu, Liping	Wuhan University of Technology
He, Zhican	Wuhan University of Technology
Chu, Duanfeng	Wuhan University of Technology
Wang, Rukang	Wuhan University of Technology
Peng, Saiqian	Wuhan University of Technology
Zhou, Pan	Huazhong University of Science and Technology
Keywords: Planning, Scheduling and Coordination, AI-Enabled Robotics, Intelligent Transportation Systems Abstract: This paper proposes a novel method for multilane convoy formation control that uses large language models (LLMs) to tackle coordination challenges in dynamic highway environments. Each connected and autonomous vehicle in the convoy uses a knowledge-driven approach to make real-time adaptive decisions based on various scenarios. Our method enables vehicles to dynamically perform tasks, including obstacle avoidance, convoy joining/leaving, and escort formation switching, all while maintaining the overall convoy structure. We design a Interlaced formation control strategy based on locally dynamic distributed graphs, ensuring the convoy remains stable and flexible. We conduct extensive experiments in the SUMO simulation platform across multiple traffic scenarios, and the results demonstrate that the proposed method is effective, robust, and adaptable to dynamic environments. The code is available at: https://github.com/chuduanfeng/ConvoyLLM.

15:15-15:20, Paper WeCT27.4
Robots Calling the Shots: Using Multiple Ground Robots for Autonomous Tracking in Cluttered Environments

Zhang, Weijian	University of Bimingham
Street, Charlie	University of Birmingham
Mansouri, Masoumeh	Birmingham University
Keywords: Planning, Scheduling and Coordination, Optimization and Optimal Control, Multi-Robot Systems Abstract: A common task in cinematography is tracking a subject or character through a scene. For complex setups, multiple cameras must track the subject simultaneously to attain sufficient coverage. Recently, researchers have considered using multiple camera-mounted autonomous mobile robots for this task. Existing work is limited to UAVs, which may be unavailable due to cost, safety requirements, or flight restrictions. Therefore, in this paper we present a tracking approach for complex and unstructured environments using differential-drive robots with gimbal-mounted cameras. Differential-drive robots pose a challenge, as their movement is more restricted than UAVs. For this, we introduce a novel hierarchical planning framework which ensures safety and visibility while maximizing shot diversity. We begin by synthesizing a set of paths using sequential greedy viewpoint planning and conflict-based search under a set of optimal viewpoint constraints. These paths then form an initial guess for joint trajectory optimization, which synthesizes stable trajectories under the motion constraints of the robots and gimbals. Empirically, we show how our approach outperforms approaches aimed at UAVs, which may synthesize infeasible trajectories when applied to differential-drive robots.

15:20-15:25, Paper WeCT27.5
Team Orienteering Problem with Communication Constraints

P. T. Tristão, Marco Túlio	Computer Vision and Robotics Laboratory (VeRLab) Universidade Fe
G. Macharet, Douglas	Universidade Federal De Minas Gerais
Keywords: Planning, Scheduling and Coordination, Path Planning for Multiple Mobile Robots or Agents, Networked Robots Abstract: Multi-Robot Systems (MRS) are increasingly utilized in applications such as surveillance, environmental monitoring, and search and rescue, where maximizing mission rewards under budget constraints is critical. The Team Orienteering Problem (TOP) provides a framework for optimizing task coverage and resource allocation in such scenarios. However, traditional TOP formulations often overlook real-world constraints, such as limited communication ranges and the necessity of persistent connectivity among robots. These constraints are particularly relevant in environments like disaster zones and remote areas, where communication infrastructure is unreliable or absent. To address this gap, we propose a multi-objective formulation that balances task coverage, communication quality and energy expenditure under a fixed budget. Our approach accommodates teams of any size and heterogeneous vehicles with varying velocities and constant thrust. We validate our approach through extensive experiments across diverse scenarios and team configurations.

15:25-15:30, Paper WeCT27.6
RRT*former: Environment-Aware Sampling-Based Motion Planning Using Transformer

Feng, Mingyang	Shanghai Jiao Tong University
Li, Shaoyuan	Shanghai Jiao Tong University
Yin, Xiang	Shanghai Jiao Tong Univ
Keywords: Planning, Scheduling and Coordination, Computer Vision for Transportation, Discrete Event Dynamic Automation Systems Abstract: We investigate the sampling-based optimal path planning problem for robotics in complex and dynamic environments. Most existing sampling-based algorithms neglect environmental information or the information from previous samples. Yet, these pieces of information are highly informative, as leveraging them can provide better heuristics when sampling the next state. In this paper, we propose a novel sampling-based planning algorithm, called RRTformer, which integrates the standard RRT algorithm with a Transformer network in a novel way. Specifically, the Transformer is used to extract features from the environment and leverage information from previous samples to better guide the sampling process. Our extensive experiments demonstrate that, compared to existing sampling-based approaches such as RRT, Neural RRT, and their variants, our algorithm achieves considerable improvements in both the optimality of the path and sampling efficiency. The code for our implementation is available on url{https://github.com/fengmingyang666/RRTformer}.

15:30-15:35, Paper WeCT27.7
MaxAuc: A Max-Plus-Based Auction Approach for Multi-Robot Allocations for Time-Ordered Temporal Logic Tasks

Wei, Mengjie	Shanghai Jiao Tong University
Li, Yuda	Shanghai Jiao Tong University
Wang, Siqi	Shanghai Jiao Tong University
Li, Shaoyuan	Shanghai Jiao Tong University
Yin, Xiang	Shanghai Jiao Tong Univ
Keywords: Planning, Scheduling and Coordination, Formal Methods in Robotics and Automation, Discrete Event Dynamic Automation Systems Abstract: In this paper, we investigate a multi-robot task allocation problem where a team of heterogeneous robots operates in a discrete workspace to achieve a set of tasks expressed by linear temporal logic formulas. In contrast to existing works, we further consider inter-task-time-order con- straints, which are imposed on the start or end times of each task. Solving such problems generally requires combinatorial search, which is not scalable. Inspired by the efficiency of max-plus algebra in handling time constraints, we propose a novel approach called MaxAuc, which integrates Auction-based task allocation with Max-plus algebra in a novel manner. Specifically, max-plus computations are performed to approximate task priorities in the auction without explicitly solving the constraint optimization problem. Our numerical results demonstrate that MaxAuc is highly scalable with respect to both the number of robots and the number of tasks, while maintaining a tolerable performance trade-off compared to the baseline’s optimal yet exhaustive solution.

15:35-15:40, Paper WeCT27.8
Efficient Human-Aware Task Allocation for Multi-Robot Systems in Shared Environments

Kazemi Eskeri, Maryam	Aalto University
Kyrki, Ville	Aalto University
Baumann, Dominik	Aalto University
Kucner, Tomasz Piotr	Aalto University
Keywords: Planning, Scheduling and Coordination, Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems Abstract: Multi Robot Systems are increasingly deployed in applications, such as intralogistics or autonomous delivery, where multiple robots collaborate to complete tasks efficiently. One of the key factors enabling their efficient cooperation is Multi-Robot Task Allocation (MRTA). Algorithms solving this problem optimize task distribution among robots to minimize the overall execution time. In shared environments, apart from the relative distance between the robots and the tasks, the execution time is also significantly impacted by the delay caused by navigating around moving people. However, most existing MRTA approaches are dynamics-agnostic, relying on static maps and neglecting human motion patterns, leading to inefficiencies and delays. In this paper, we introduce Human- Aware Task Allocation (HATA). This method leverages Maps of Dynamics (MoDs), spatio-temporal queryable models designed to capture historical human movement patterns, to estimate the impact of humans on the task execution time during deployment. HATA utilizes a stochastic cost function that includes MoDs Experimental results show that integrating MoDs enhances task allocation performance, resulting in reduced mission completion times by up to 26% compared to the dynamics- agnostic method and up to 19% compared to the baseline. This work underscores the importance of considering human dynamics in MRTA within shared environments and presents an efficient framework for deploying multi-robot systems in environments populated by humans.


WeCT28	104
Marine Robotics 7	Regular Session
Chair: Liu, Yuanchang	University College London
Co-Chair: Graf, Moritz	Technical University of Munich

15:00-15:05, Paper WeCT28.1
Acoustic Neural 3D Reconstruction under Pose Drift

Lin, Tianxiang	Carnegie Mellon University
Qadri, Mohamad	Carnegie Mellon University
Zhang, Kevin	University of Maryland, College Park
Pediredla, Adithya	Dartmouth College
Metzler, Christopher	University of Maryland, College Park
Kaess, Michael	Carnegie Mellon University
Keywords: Marine Robotics, Mapping, Field Robots Abstract: We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.

15:05-15:10, Paper WeCT28.2
Online Residual Model Learning for Model Predictive Control of Autonomous Surface Vehicles in Real-World Environments

Gamboa-Gonzalez, Arturo	University of Wisconsin-Madison
Li, Chunlin	University of Toronto
Wehner, Michael	University of Wisconsin, Madison
Wang, Wei	University of Wisconsin-Madison
Keywords: Marine Robotics, Model Learning for Control, Field Robots Abstract: Model predictive control (MPC) relies on an accurate dynamics model to achieve precise and safe robot operation. In complex and dynamic aquatic environments, developing an accurate model that captures hydrodynamic details and accounts for environmental disturbances like waves, currents, and winds is challenging for aquatic robots. In this paper, we propose an online residual model learning framework for MPC, which leverages approximate models to learn complex unmodeled dynamics and environmental disturbances in dynamic aquatic environments. We integrate offline learning from previous simulation experience with online learning from the robot's real-time interactions with the environments. These three components—residual modeling, offline learning, and online learning—enable a highly sample-efficient learning process, allowing for accurate real-time inference of model dynamics in complex and dynamic conditions. We further integrate this online learning residual model into a nonlinear model predictive controller, enabling it to actively choose the optimal control actions that optimize the control performance. Extensive simulations and real-world experiments with an autonomous surface vehicle demonstrate that our residual model learning MPC significantly outperforms conventional MPCs in dynamic field environments.

15:10-15:15, Paper WeCT28.3
Active Disturbance Rejection Control for Trajectory Tracking of a Seagoing USV: Design, Simulation, and Field Experiments

van der Saag, Jelmer	S[&]T
Trevisan, Elia	Delft University of Technology
Falkena, Wouter	Demcon Unmanned Systems
Alonso-Mora, Javier	Delft University of Technology
Keywords: Marine Robotics, Field Robots, Robust/Adaptive Control Abstract: Unmanned Surface Vessels (USVs) face significant control challenges due to uncertain environmental disturbances like waves and currents. This paper proposes a trajectory tracking controller based on Active Disturbance Rejection Control (ADRC) implemented on the DUS V2500. A custom simulation incorporating realistic waves and current disturbances is developed to validate the controller's performance, supported by further validation through field tests in the harbour of Scheveningen, the Netherlands, and at sea. Simulation results demonstrate that ADRC significantly reduces cross-track error across all tested conditions compared to a baseline PID controller but increases control effort and energy consumption. Field trials confirm these findings while revealing a further increase in energy consumption during sea trials compared to the baseline. Videos can be found at https://autonomousrobots.nl/paper_websites/adrc-demcon.

15:15-15:20, Paper WeCT28.4
Model-Based External Wrench Estimation for Underwater Robots

Graf, Moritz	Technical University of Munich
Duecker, Daniel Andre	Technical University of Munich (TUM)
Keywords: Marine Robotics, Field Robots, Force and Tactile Sensing Abstract: Similarly to aerial drones, small-scale underwater robots are prone to external wrenches resulting from disturbances such as water currents or collisions. Estimating the external wrench acting on an underwater robot is challenging due to non-linear hydrodynamic effects and the bottleneck of being limited to onboard sensing. We build on a model-based approach for aerial wrench estimation and extend it to the underwater domain. Various modifications are applied, such as capturing hydrodynamic effects, and new sensory information is integrated, for example, via Doppler velocity log (DVL). We evaluate the performance of the proposed approach through a series of experiments. Moreover, we assess the effect of fusing various sensor configurations and their respective influence on the wrench estimate, including low-cost vs. high-end IMU and DVL. Our adapted approach from the aerial domain delivers good results in estimating external wrenches on underwater robots. While the IMU quality is found to be less important, considering the underwater domain-specific damping terms is critical.

15:20-15:25, Paper WeCT28.5
ASV-Aided AUV Navigation: A Field Study on Nonlinear Estimation for Localization of Low-Cost, Scalable Systems

Turrisi, Raymond	Massachusetts Institute of Technology
Duecker, Daniel Andre	Technical University of Munich (TUM)
Morrison, John	MIT
Steinmetz, Fabian	Hamburg University of Technology
Benjamin, Michael	Massachusetts Institute of Technology
Keywords: Marine Robotics, Field Robots, Multi-Robot Systems Abstract: In this work, we investigate the use of multiple Autonomous Surface Vehicles (ASVs) as Communication/Navigation Aids (CNAs) to enhance the navigation and state estimation of an Autonomous Underwater Vehicle (AUV). Our approach builds on recent advancements in low-cost sensors and platforms, which enable novel AUV applications across fundamental science, commercial industries, and defense. We consider six different combinations of Kalman Filter and Factor Graph localization solutions on three datasets, covering 53 minutes and 3.1 kilometers of operation. We first present the solution using the measurements from all three ASVs, before occluding measurements from two of the ASVs to assess the effect of reduced observability on localization performance.

15:25-15:30, Paper WeCT28.6
All-In-One Defensive Network (ADNet): Trustworthy Segmentation of Complex Maritime Environments for Unmanned Surface Vessels (USVs)

Huang, Yanhong	Wuhan University of Technology
Duan, Yuze	University College London
Wu, Peng	University College London
Liu, Yuanchang	University College London
Keywords: Marine Robotics, Object Detection, Segmentation and Categorization, Computer Vision for Transportation Abstract: The visual perception system of unmanned surface vessels (USVs) is often subjected to various adversarial attacks (e.g., lens stains, sun glare, ship painting, etc.), impacting the safety of autonomous navigation in maritime environments. To enhance the reliability and robustness of situational awareness in complex environments, we proposed a defensive model to effectively counteract multiple attacks targeting the perception system. Specifically, we first constructed a maritime instance segmentation dataset including various adversarial attack samples, with accurate annotations for the sky, water, land, ships and obstacles. To address the degradation in perception accuracy caused by adversarial attacks, we introduced a Monte Carlo-based random fusion module (MC Fusion) to enhance the adaptability of USVs in various dynamic environments. Additionally, as USVs are always equipped with onboard PC with limited computing resources, we incorporated the lightweight universal inverted bottleneck (UIB) module into the backbone to ensure effective feature extraction while reducing model parameters. Finally, we conducted comparative experiments under various adversarial attack scenarios. Our results demonstrate that, even in the presence of multiple adversarial attacks, our method improves ship detection accuracy by 13.9% and increases the mean accuracy of segmentation masks by over 10% compared to state-of-the-art models, enhancing the safety of USVs in navigation. The source code and datasets are available at https://github.com/huangyanh/ADNet.

15:30-15:35, Paper WeCT28.7
AUV-WTN: AUV Water Tunnel Navigation Framework with Acoustic Interference and Narrow Space Constraints

Zheng, Haotian	Harbin Engineering University
Sun, Yushan	Harbin Engineering University
Zhang, Liwen	Harbin Engineering University
Wang, Xiaotian	Harbin Engineering University
Ren, Jingfei	Harbin Engineering University
Fu, Jinyu	Harbin Engineering University
Keywords: Marine Robotics, Object Detection, Segmentation and Categorization Abstract: In water conveyance tunnels, autonomous navigation of autonomous underwater vehicles (AUVs) is challenging under accumulated localization errors and severe acoustic interference constraints. An AUV water tunnel navigation (AUV-WTN) framework is proposed to address these challenges. AUV-WTN integrates a forward-looking sonar (FLS) image segmentation method based on the refined mask R-CNN (RM R-CNN) network with real-time trajectory planning that employs the dynamic trajectory homotopy method (DTHM). RM R-CNN is optimized to combine a mixed-frequency block (MFB) along with a weighted loss function. Additionally, precise region of interest pooling (PrRoI Pooling) is combined to mitigate the impact of false targets, blurred edges, and noise on segmentation accuracy. DTHM is proposed to reduce trajectory drift by dynamically updating path generation based on segmented FLS images. Experimental results demonstrate that RM R-CNN outperforms state-of-the-art methods, achieving a 10.9% improvement over Mask R-CNN in mask segmentation. The simulation platform and real AUV experiments indicate that the capability of AUV-WTN framework is effective in generating precise paths and ensuring collision-free navigation in tunnel environments.


WeCT29	105
SLAM: Localization 2	Regular Session
Chair: Ming, Yuhang	Hangzhou Dianzi University
Co-Chair: Lu, Huimin	National University of Defense Technology

15:00-15:05, Paper WeCT29.1
Speak the Same Language: Global LiDAR Registration on BIM Using Pose Hough Transform (I)

Qiao, Zhijian	Hong Kong University of Science and Technology
Huang, Haoming	The Hong Kong University of Science and Technology
Liu, Chuhao	Hong Kong University of Science and Technology
Yu, Zehuan	Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Yin, Huan	Hong Kong University of Science and Technology
Keywords: SLAM, Localization, Robotics and Automation in Construction Abstract: Light detection and ranging (LiDAR) point clouds and building information modeling (BIM) represent two distinct data modalities in the fields of robot perception and construction. These modalities originate from different sources and are associated with unique reference frames. The primary goal of this study is to align these modalities within a shared reference frame using a global registration approach, effectively enabling them to ``speak the same language''. To achieve this, we propose a cross-modality registration method, spanning from the front end to the back end. At the front end, we extract triangle descriptors by identifying walls and intersected corners, enabling the matching of corner triplets with a complexity independent of the BIM's size. For the back-end transformation estimation, we utilize the Hough transform to map the matched triplets to the transformation space and introduce a hierarchical voting mechanism to hypothesize multiple pose candidates. The final transformation is then verified using our designed occupancy-aware scoring method. To assess the effectiveness of our approach, we conducted real-world multi-session experiments in a large-scale university building, employing two different types of LiDAR sensors. We make the collected datasets and codes publicly available to benefit the community.

15:05-15:10, Paper WeCT29.2
SLC^2-SLAM: Semantic-Guided Loop Closure Using Shared Latent Code for NeRF SLAM

Ming, Yuhang	Hangzhou Dianzi University
Ma, Di	Hangzhou Dianzi University
Dai, Weichen	Hangzhou Dianzi University
Yang, Han	Hangzhou Dianzi University
Fan, Rui	Tongji University
Zhang, Guofeng	Zhejiang University
Kong, Wanzeng	Hangzhou Dianzi University
Keywords: SLAM, Localization, Semantic Scene Understanding Abstract: Targeting the notorious cumulative drift errors in NeRF SLAM, we propose a Semantic-guided Loop Closure using Shared Latent Code, dubbed SLC^2-SLAM. We argue that latent codes stored in many NeRF SLAM systems are not fully exploited, as they are only used for better reconstruction. In this paper, we propose a simple yet effective way to detect potential loops using the same latent codes as local features. To further improve the loop detection performance, we use the semantic information, which are also decoded from the same latent codes to guide the aggregation of local features. Finally, with the potential loops detected, we close them with a graph optimization followed by bundle adjustment to refine both the estimated poses and the reconstructed scene. To evaluate the performance of our SLC^2-SLAM, we conduct extensive experiments on Replica and ScanNet datasets. Our proposed semantic-guided loop closure significantly outperforms the pre-trained NetVLAD and ORB combined with Bag-of-Words, which are used in all the other NeRF SLAM with loop closure. As a result, our SLC^2-SLAM also demonstrated better tracking and reconstruction performance, especially in larger scenes with more loops, like ScanNet.

15:10-15:15, Paper WeCT29.3
SGLC: Semantic Graph-Guided Coarse-Fine-Refine Full Loop Closing for LiDAR SLAM

Wang, Neng	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Shi, Chenghao	NUDT
Zheng, Zhiqiang	National University of Defense Technology
Yu, Hongshan	Hunan University
Lu, Huimin	National University of Defense Technology
Keywords: SLAM, Localization Abstract: Loop closing in SLAM corrects accumulated errors via detection and pose estimation. Most methods focus on robust descriptors but neglect pose estimation, while existing solutions are either inaccurate or computationally expensive. To tackle this problem, we introduce SGLC, a real-time semantic graph-guided full loop closing method, with robust loop closure detection and 6-DoF pose estimation capabilities.SGLC takes into account the distinct characteristics of foreground and background points.For foreground instances, it builds a semantic graph that not only abstracts point cloud representation for fast descriptor generation and matching but also guides the subsequent loop verification and initial pose estimation. Background points, meanwhile, are exploited to provide more geometric features for scan-wise descriptor construction and stable planar information for further pose refinement. Loop pose estimation employs a coarse-fine-refine registration scheme that considers the alignment of both instance points and background points, offering high efficiency and accuracy. Extensive experiments on multiple publicly available datasets demonstrate its superiority over state-of-the-art methods. Additionally, we integrate SGLC into a SLAM system, eliminating accumulated errors and improving overall SLAM performance.The implementation of SGLC has been released at https://github.com/nubot-nudt/SGLC.

15:15-15:20, Paper WeCT29.4
DDN-SLAM: Real Time Dense Dynamic Neural Implicit SLAM

Li, Mingrui	Dalian University of Technology
Guo, Zhetao	Cloudspace Technology Co., Ltd
Deng, Tianchen	Shanghai Jiao Tong University
Zhou, Yiming	Saarland University of Applied Science
Yuxiang, Ren	Beijing Dianjing Ciyuan Culture Communication Co. , Ltd.,
Wang, Hongyu	Dalian University of Technology
Keywords: SLAM, Mapping, Localization Abstract: SLAM systems based on NeRF have demonstrated superior performance in rendering quality and scene reconstruction for static environments compared to traditional dense SLAM. However, they encounter tracking drift and mapping errors in real-world scenarios with dynamic interferences. To address these issues, we propose DDN-SLAM, a real-time dense dynamic neural implicit SLAM system integrating semantic features. To address dynamic tracking interferences, we propose a feature point segmentation method that combines semantic features with a mixed Gaussian distribution model. To avoid incorrect background removal, we propose a mapping strategy based on sparse point cloud sampling and background restoration. We propose a dynamic semantic loss to eliminate dynamic occlusions. Experimental results demonstrate that DDN-SLAM is capable of robustly tracking and producing high-quality reconstructions in dynamic environments, while appropriately preserving potential dynamic objects. Compared to existing neural implicit SLAM systems, the tracking results on dynamic datasets indicate an average 90% improvement in Average Trajectory Error (ATE) accuracy. https://github.com/DrLi-Ming/DDN-SLAM

15:20-15:25, Paper WeCT29.5
GLO: General LiDAR-Only Odometry with High Efficiency and Low Drift

Su, Yun	Guangzhou Shiyuan Electronic Technology Company Limited
Shao, Shiliang	SIA
Zhang, Ziyong	CVTE
Xu, Pengfei	CVTE
Cao, Yong	Guangzhou Shiyuan Electronic Technology Company Limited
Cheng, Hui	Sun Yat-Sen University
Keywords: SLAM, Mapping, Localization Abstract: This study proposes GLO, a general LiDAR-only odometry method with high efficiency and low drift. First, we propose a map data structure using multilevel voxels to improve map update efficiency. Each voxel node actively maintains plane features, minimizing redundant fitting and enhancing matching efficiency. By calculating the occupancy probability of each voxel node, dynamic and unstable points in the map can be efficiently removed. Second, we introduce a weighted elastic matching algorithm that adjusts the weights of each matching constraint across multiple dimensions, such as measurement depth, fitting error, occupancy probability, and matching error, making the matching constraints elastic and enhancing accuracy. This paper also proposes a progressive optimization framework combining prediction, coarse matching, and fine matching. Coarse matching ensures matching convergence and corrects point-cloud motion distortion, while fine matching further refines accuracy. Extensive experiments on public datasets and real LiDAR data demonstrate the efficiency and accuracy of the proposed GLO method compared to state-of-the-art methods. The source code of GLO has been released at https://github.com/robosu12/GLO.

15:25-15:30, Paper WeCT29.6
OVSG-SLAM: Open-Vocabulary Semantic Gaussian Splatting SLAM

Liu, ZheHang	Shenzhen MSU-BIT University
Li, ShiShen	Beijing Institute of Technology
Huang, GuiChen	Beijing Institute of Technology
Wu, Yuwei	BeiJing Institute of Technology
Keywords: SLAM, Semantic Scene Understanding, Computer Vision for Automation Abstract: Most conventional semantic SLAM approaches concentrate on maintaining 3D semantic consistency while overlooking their reliance on predefined semantic categories, ultimately limiting flexibility in scene understanding. We propose Open-Vocabulary Semantic Gaussian Splatting SLAM (OVSG-SLAM), a approach that integrates multi-modal perception and 3D Gaussian splatting into a semantic SLAM framework. By combining the advantages of Segment Anything (SAM) for open-vocabulary 2D scene understanding with the powerful feature extraction capabilities of vision-language models, our method eliminates the reliance on predefined closed-set categories. Although Vision-Language Models (VLMs) provide open-vocabulary reasoning, integrating them with 3D semantic SLAM poses challenges such as embedding ambiguity and computational overhead. To address these challenge, we present a feature embedding strategy called differentiable identity-aware encoding, which reduces computational cost while ensuring accurate semantic mapping. Furthermore, instead of using a traditional semantic loss, we optimize the scene representation through an identity loss. Extensive experimental evaluations on the Replica and ScanNet datasets demonstrate that the proposed method achieves state-of-the-art performance in mapping, tracking and 3D semantic segmentation tasks.


WeCT30	106
Aerial Systems: Perception and Autonomy 1	Regular Session
Chair: Ollero, Anibal	AICIA. G41099946
Co-Chair: Xu, Chao	Zhejiang University

15:00-15:05, Paper WeCT30.1
Controlled Shaking of Trees with an Aerial Manipulator (I)

Gonzalez-Morgado, Antonio	Universidad De Sevilla
Cuniato, Eugenio	ETH Zurich
Tognon, Marco	Inria Rennes
Heredia, Guillermo	University of Seville
Siegwart, Roland	ETH Zurich
Ollero, Anibal	AICIA. G41099946
Keywords: Aerial Systems: Mechanics and Control, Force Control, Aerial Systems: Applications Abstract: In recent years, the fields of application of aerial manipulators have expanded, ranging from infrastructure inspection to physical interaction with flexible elements such as branches and trees. This paper presents the controlled shaking of a tree with an aerial manipulator. Our work aims at contributing to applications like the identification of tree parameters for environmental health monitoring or the collection of samples and fruits by vibration. To this end, we propose a control strategy for controlled shaking of flexible systems. We adopt a self-excited oscillation strategy that induces vibrations at the natural frequency of the system, at which the greatest amplification and therefore the greatest vibrations occur. Likewise, this work presents a simplified 1DoF model based on the Rayleigh-Ritz method to analyze dynamic interaction between a tree and the aerial manipulator with the controlled shaking strategy. The proposed control strategy is evaluated through indoor experiments, where an aerial manipulator shakes an indoor tree made of bamboo canes. Experimental results show how the proposed model can estimate properly the amplitude of the vibration and the frequency of the vibration, depending on the grasping point and the control gain of the self-excited oscillation strategy.

15:05-15:10, Paper WeCT30.2
Adaptive Optimal Admittance Control for Robotic Precision Grinding Based on Improved Normalized Advantage Function (I)

Wu, Haotian	Huazhong University of Science and Technology
Yang, Jianzhong	Huazhong University of Science and Technology
Huang, Si	Huazhong University of Science and Technology
Li, Jiahui	Huazhong University of Science and Technology
Keywords: Aerial Systems: Mechanics and Control, Force Control, Industrial Robots Abstract: Precision force control technology is central to robotic precision grinding, which presents challenges: optimizing parameters, managing significant contact impacts, and handling uncertain environmental conditions. This paper focuses on the robotic precision grinding of blades. To mitigate the pulse impact during initial contact, we devised a post online fastest-tracking differentiator. Furthermore, to enhance force control stability during the grinding, we introduced an adaptive optimal admittance control strategy utilizing the linear quadratic regulator (LQR) to determine optimal control parameters. However, solving the Riccati equation in LQR poses a challenge due to the unknown environment. To tackle this issue, we devised a model-based accelerated learning approach based on the Normalized Advantage Function. This method employs a dual network architecture for policy learning and enhances the reward and acceleration mechanism to improve training speed and convergence. The experiment was performed through robotic precision grinding on aero engine blades. Results indicate that our method enhances force control precision by approximately 210% compared to traditional fixed-parameter grinding methods. Post-grinding, blade roughness improved from Ra0.4 to Ra0.3, and profile deviation was reduced from [0.050, -0.070] to [0.049, -0.030].

15:10-15:15, Paper WeCT30.3
FACT: Fast and Active Coordinate Initialization for Vision-Based Drone Swarms

Li, Yuan	Zhejiang University
Zhao, Anke	Zhejiang University
Wang, Yingjian	Zhejiang University
Xu, Ziyi	Zhejiang University
Zhou, Xin	ZHEJIANG UNIVERSITY
Xu, Chao	Zhejiang University
Zhou, Jinni	Hong Kong University of Science and Technology (Guangzhou)
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Perception and Autonomy, Swarm Robotics, Vision-Based Navigation Abstract: Coordinate initialization is the first step in accomplishing collaborative tasks within robot swarms, determining the quality of tasks. However, fast and robust coordinate initialization in vision-based drone swarms remains elusive. To this end, our paper proposes a complete system for initial relative pose estimation, including both relative state estimation and active planning. Particularly, our work fuses onboard visual-inertial odometry with vision-based observations generating bearing and range measurements, which are anonymous, partially mutual, and noisy. It is the first method based on convex optimization to initialize coordinates with vision-based observations. Additionally, we designed a lightweight module to actively control the movement of robots for observation acquisition and collision avoidance. With only stereo cameras and inertial measurement units as sensors, we validate the practicability of our system in simulations and real-world areas with obstacles and without signals from the Global Navigation Satellite System. Compared to methods based on local optimization and filters, our system can achieve the global optimum for coordinate initialization more stably and quickly, which is suitable for robots with size, weight, and power constraints. The source code is released for reference.

15:15-15:20, Paper WeCT30.4
NavRL: Learning Safe Flight in Dynamic Environments

Xu, Zhefan	Carnegie Mellon University
Han, Xinming	Carnegie Mellon University
Shen, Haoyu	Carnegie Mellon University
Jin, Hanyu	Carnegie Mellon University
Shimada, Kenji	Carnegie Mellon University
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Collision Avoidance Abstract: Safe flight in dynamic environments requires unmanned aerial vehicles (UAVs) to make effective decisions when navigating cluttered spaces with moving obstacles. Traditional approaches often decompose decision-making into hierarchical modules for prediction and planning. Although these handcrafted systems can perform well in specific settings, they might fail if environmental conditions change and often require careful parameter tuning. Additionally, their solutions could be suboptimal due to the use of inaccurate mathematical model assumptions and simplifications aimed at achieving computational efficiency. To overcome these limitations, this paper introduces the NavRL framework, a deep reinforcement learning-based navigation method built on the Proximal Policy Optimization (PPO) algorithm. NavRL utilizes our carefully designed state and action representations, allowing the learned policy to make safe decisions in the presence of both static and dynamic obstacles, with zero-shot transfer from simulation to real-world flight. Furthermore, the proposed method adopts a simple but effective safety shield for the trained policy, inspired by the concept of velocity obstacles, to mitigate potential failures associated with the black-box nature of neural networks. To accelerate the convergence, we implement the training pipeline using NVIDIA Isaac Sim, enabling parallel training with thousands of quadcopters. Simulation and physical experiments show that our method ensures safe navigation in dynamic environments and results in the fewest collisions compared to benchmarks.

15:20-15:25, Paper WeCT30.5
EPIC: A Lightweight LiDAR-Based UAV Exploration Framework for Large-Scale Scenarios

Geng, Shuang	The University of Hong Kong
Ning, Zelin	Sun Yat-Sen University
Zhang, Fu	University of Hong Kong
Zhou, Boyu	Southern University of Science and Technology
Keywords: Aerial Systems: Perception and Autonomy, Motion and Path Planning, Aerial Systems: Applications Abstract: Autonomous exploration is a fundamental problem for various applications of unmanned aerial vehicles (UAVs). Recently, LiDAR-based exploration has gained significant attention due to its ability to generate high-precision point cloud maps of large-scale environments. While the point clouds are inherently informative for navigation, many existing exploration methods still rely on additional, often expensive, environmental representations. This reliance stems from two main reasons: the need for frontier detection or information gain computation, which typically depends on memory-intensive occupancy grid maps, and the high computational complexity of path planning directly on point clouds, primarily due to costly collision checking. To address these limitations, we present EPIC, a lightweight LiDAR-based UAV exploration framework that directly exploits point cloud data to explore large-scale environments. EPIC introduces a novel observation map derived directly from the quality of point clouds, eliminating the need for global occupancy grid maps while preserving comprehensive exploration capabilities. We also propose an incremental topological graph construction method operating directly on point clouds, enabling real-time path planning in large-scale environments. Leveraging these components, we build a hierarchical planning framework that generates agile and energy-efficient trajectories, achieving significantly reduced memory consumption and computation time compared to most existing methods. Extensive simulations and real-world experiments demonstrate that EPIC achieves faster exploration while significantly reducing memory consumption compared to state-of-the-art methods.

15:25-15:30, Paper WeCT30.6
Intent Prediction-Driven Model Predictive Control for UAV Planning and Navigation in Dynamic Environments

Xu, Zhefan	Carnegie Mellon University
Jin, Hanyu	Carnegie Mellon University
Han, Xinming	Carnegie Mellon University
Shen, Haoyu	Carnegie Mellon University
Shimada, Kenji	Carnegie Mellon University
Keywords: Aerial Systems: Perception and Autonomy, Integrated Planning and Control, RGB-D Perception Abstract: Aerial robots can enhance construction site productivity by autonomously handling inspection and mapping tasks. However, ensuring safe navigation near human workers remains challenging. While navigation in static environments has been well studied, navigating dynamic environments remains open due to challenges in perception and planning. Payload limitations restrict the robots to using cameras with limited fields of view, resulting in unreliable perception and tracking during collision avoidance. Moreover, the rapidly changing conditions of dynamic environments can quickly make the generated optimal trajectory outdated.To address these challenges, this paper presents a comprehensive navigation framework that integrates perception, intent prediction, and planning. Our perception module detects and tracks dynamic obstacles efficiently and handles tracking loss and occlusion during collision avoidance. The proposed intent prediction module employs a Markov Decision Process (MDP) to forecast potential actions of dynamic obstacles with the possible future trajectories. Finally, a novel intent-based planning algorithm, leveraging model predictive control (MPC), is applied to generate navigation trajectories. Simulation and physical experiments demonstrate that our method improves the safety of navigation by achieving the fewest collisions compared to benchmarks.

15:30-15:35, Paper WeCT30.7
Seeing through Pixel Motion: Learning Obstacle Avoidance from Optical Flow with One Camera

Hu, Yu	Shanghai Jiao Tong University
Zhang, Yuang	Shanghai Jiao Tong University
Song, Yunlong	University of Zurich
Deng, Yang	Shanghai Jiao Tong University
Yu, Feng	Shanghai Jiao Tong University
Zhang, Linzuo	Shanghai Jiao Tong University
Lin, Weiyao	Shanghai Jiaotong University
Zou, Danping	Shanghai Jiao Ton University
Yu, Wenxian	Shanghai Jiao Tong University
Keywords: Aerial Systems: Perception and Autonomy, Collision Avoidance, Sensorimotor Learning Abstract: Optical flow captures the motion of pixels in an image sequence over time, providing information about movement, depth, and environmental structure. Flying insects utilize this information to navigate and avoid obstacles, allowing them to execute highly agile maneuvers even in complex environments. Despite its potential, autonomous flying robots have yet to fully leverage this motion information to achieve comparable levels of agility and robustness. The main challenges are two-fold: (1) extracting accurate optical flow from visual data during high-speed flight and (2) designing a robust controller that can handle noisy optical flow estimations while ensuring robust performance in complex environments. To address these challenges, we propose a novel end-to-end system for quadrotor obstacle avoidance using monocular optical flow. We develop an efficient differentiable simulator coupled with a simplified quadrotor model, allowing our policy to be trained directly through first-order gradient optimization. Additionally, we introduce a central flow attention mechanism and an action-guided active sensing strategy that enhances the policy’s focus on task-relevant optical flow observations to enable more responsive decision-making during flight. Our system is validated both in simulation and the real world using an FPV racing drone. Despite being trained in a simple environment in simulation, our system demonstrates agile and robust flight in various unknown, cluttered environments in the real world at speeds of up to 6m/s.

15:35-15:40, Paper WeCT30.8
Learning Cross-Modal Visuomotor Policies for Autonomous Drone Navigation

Zhang, Yuhang	Nanyang Technological University
Xiao, Jiaping	Nanyang Technological University
Feroskhan, Mir	Nanyang Technological University
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Deep Learning for Visual Perception Abstract: Developing effective vision-based navigation algorithms adapting to various scenarios is a significant challenge for autonomous drone systems, with vast potential in diverse real-world applications. This paper proposes a novel visuomotor policy learning framework for monocular autonomous navigation, combining cross-modal contrastive learning with deep reinforcement learning (DRL) to train a visuomotor policy. Our approach first leverages contrastive learning to extract consistent, task-focused visual representations from high-dimensional RGB images as depth images, and then directly maps these representations to action commands with DRL. This framework enables RGB images to capture structural and spatial information similar to depth images, which remains largely invariant under changes in lighting and texture, thereby maintaining robustness across various environments. We evaluate our approach through simulated and physical experiments, showing that our visuomotor policy outperforms baseline methods in both effectiveness and resilience to unseen visual disturbances. Our findings suggest that the key to enhancing transferability in monocular RGB-based navigation lies in achieving consistent, well-aligned visual representations across scenarios, which is an aspect often lacking in traditional end-to-end approaches. Video is available at href{https://www.youtube.com/watch?v=cN0cI4My22E&t=25s}{https://www.youtube.com/watch?v=cN0cI4My22E}.


WeDT1	401
Sensor Fusion & SLAM 3	Regular Session
Chair: Tran, Dinh Tuan	Shiga University
Co-Chair: Maiolino, Perla	University of Oxford

16:40-16:45, Paper WeDT1.1
Ultra-Wideband Assisted Visual-Inertial Localization Correction System with Position-Unknown UWB Anchors

Xing, Yu	Beijing Institute of Technology
Li, Weixing	Beijing Institute of Technology
Pan, Feng	Beijing Institute of Technology
Feng, Xiaoxue	Beijing Institute of Technology
Keywords: Localization, Sensor Fusion, Range Sensing Abstract: Given the fact that visual-inertial odometry (VIO) is faced with the challenges of localization drift in the long run, we utilize drift-free Ultra-Wideband (UWB) measurements to eliminate accumulated errors in VIO. Existing UWB-VIO fusion methods are mostly constrained by the accuracy of prior UWB anchor positions. However, in large-scale localization scenarios, the precise locations of UWB anchors are difficult to obtain, and the offline calibration process is complex, significantly limiting flexibility. In this paper, we firstly design a lightweight initialization method based on a dual sliding window structure, which can rapidly obtain initial guesses for the UWB anchor coordinates. After that, we further propose a joint estimation system to refine the anchor coordinates while estimating the correction for VIO. The system combines filter-based and optimization-based methods, which mainly consists of an initialization module and a nonlinear estimator module. The filter in the initialization module provides optimization initial values and covariances, and mutually, the optimization results from the nonlinear estimator provide priors for the filter. Finally, the performance of our proposed approach is verified through both public datasets and real-world experiment. Our project, along with our dataset, has been open-sourced in the form of ROS package and ROS bag.

16:45-16:50, Paper WeDT1.2
Improved 3D Point-Line Mapping Regression for Camera Relocalization

Bui, Bach-Thuan	Ritsumeikan University
Bui, Huy Hoang	Ritsumeikan University
Fujii, Yasuyuki	Ritsumeikan University
Tran, Dinh Tuan	Shiga University
Lee, Joo-Ho	Ritsumeikan University
Keywords: Localization, Representation Learning Abstract: In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available.

16:50-16:55, Paper WeDT1.3
DPR-Splat: Depth and Pose Refinement with Sparse-View 3D Gaussian Splatting for Novel View Synthesis

Hu, Lingxiang	Paris Saclay University
Li, Zhiheng	Shandong University
Zhu, Xingfei	Jiangnan University
Li, Dun	Tsinghua University
Song, Ran	Shandong University
Keywords: Mapping, Localization, Deep Learning for Visual Perception Abstract: Recent advances in 3D Gaussian Splatting have demonstrated impressive performance in novel view synthesis, particularly with dense image sets. However, its performance degrades significantly in sparse-view scenarios, primarily due to the challenge of obtaining accurate camera poses. Additionally, achieving scale-consistent and detailed depth maps is crucial, yet existing depth estimation models struggle to meet both requirements, further limiting view synthesis quality in sparse settings. To address these challenges, we propose DPR-Splat, an innovative and highly efficient neural reconstruction system that accurately and rapidly builds 3D Gaussian models from sparse scenes. DPR-Splat refines the coarse outputs of MASt3R by leveraging dedicated pose refinement and depth refinement modules, resulting in more precise camera poses and depth maps. By utilizing these refined outputs, it progressively expands the 3D Gaussian set to construct a more accurate scene model. Extensive experiments demonstrate that DPR-Splat significantly enhances both novel view synthesis quality and pose estimation accuracy, while also achieving remarkably fast training and rendering speeds. Our code and demonstration video are available at https://github.com/h0xg/DPR-Splat.

16:55-17:00, Paper WeDT1.4
CVIRO: A Consistent and Tightly-Coupled Visual-Inertial-Ranging Odometry on Lie Groups

Zhou, Yizhi	George Mason University
Kang, ZiWei	North China Electric Power University
Xia, Jiawei	Beijing University of Chemical Technology
Wang, Xuan	George Mason University
Keywords: Localization, Calibration and Identification, Visual-Inertial SLAM Abstract: Ultra-Wideband (UWB) is widely used to mitigate drift in visual-inertial odometry (VIO) systems. Consistency is crucial for ensuring the estimation accuracy of a UWB-aided VIO system. An inconsistent estimator can degrade localization performance, where the inconsistency primarily arises from two main factors: (1) the estimator fails to preserve the correct system observability, and (2) UWB anchor positions are assumed to be known, leading to improper neglect of calibration uncertainty. In this paper, we propose a consistent and tightly-coupled visual-inertial-ranging odometry (CVIRO) system based on the Lie group. Our method incorporates the UWB anchor state into the system state, explicitly accounting for UWB calibration uncertainty and enabling the joint and consistent estimation of both robot and anchor states. Furthermore, observability consistency is ensured by leveraging the invariant error properties of the Lie group. We analytically prove that the CVIRO algorithm naturally maintains the system’s correct unobservable subspace, thereby preserving estimation consistency. Extensive simulations and experiments demonstrate that CVIRO achieves superior localization accuracy and consistency compared to existing methods.

17:00-17:05, Paper WeDT1.5
Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation

Malone, Connor	Queensland University of Technology
Claxton, Owen Thomas	Space Machines Company
Shames, Iman	The Australian National University
Milford, Michael J	Queensland University of Technology
Keywords: Localization, Vision-Based Navigation Abstract: Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of two adversarial attacks common in other perception tasks and two novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm -- which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement -- such as a ~50% reduction in the mean along-track localization error -- can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an `Unsafe' State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.

17:05-17:10, Paper WeDT1.6
Event-Triggered Maps of Dynamics: A Framework for Modeling Spatial Motion Patterns in Non-Stationary Environments

Shi, Junyi	Aalto University
Guo, Qingyun	Aalto University
Kucner, Tomasz Piotr	Aalto University
Keywords: Mapping, Representation Learning Abstract: In this paper, we introduce an event-triggered Maps of Dynamics (ETMoD) framework for modeling spatial motion patterns in non-stationary environments. Traditional approaches often rely on fixed grid resolutions and assume gradual temporal changes, limiting their effectiveness in real-world scenarios where motion patterns exhibit abrupt variations. To address these limitations, we propose a novel framework that employs a grid-shifting mechanism to generate context-aware cells based on historical observations. Temporal patterns are modeled using Neural Stochastic Differential Equations, while a diffusion model is integrated to handle abrupt changes in motion patterns through an event-triggered mechanism. Experimental results demonstrate that our framework outperforms state-of-the-art methods, particularly in capturing abrupt changes during peak activity periods, while significantly reducing training time.

17:10-17:15, Paper WeDT1.7
Online 6DoF Global Localisation in Forests Using Semantically-Guided Re-Localisation and Cross-View Factor-Graph Optimisation

Carvalho de Lima, Lucas	The University of Queensland
Griffiths, Ethan	Queensland University of Technology
Haghighat, Maryam	Queensland University of Technology
Denman, Simon	QUT
Fookes, Clinton	Queensland University of Technology
Borges, Paulo Vinicius Koerich	CSIRO
Bruenig, Michael	The University of Queensland
Ramezani, Milad	CSIRO
Keywords: Localization, Field Robots Abstract: This paper presents FGLoc6D, a novel approach for robust global localisation and online 6DoF pose estimation of ground robots in forest environments by leveraging deep semantically-guided re-localisation and cross-view factor graph optimisation. The proposed method addresses the challenges of aligning aerial and ground data for pose estimation, which is crucial for accurate point-to-point navigation in GPS-degraded environments. By integrating information from both perspectives into a factor graph framework, our approach effectively estimates the robot’s global position and orientation. Additionally, we enhance the repeatability of deep-learned keypoints for metric localisation in forests by incorporating a semantically-guided regression loss. This loss encourages greater attention to wooden structures, e.g., tree trunks, which serve as stable and distinguishable features, thereby improving the consistency of keypoints and increasing the success rate of global registration, a process we refer to as re-localisation. The re-localisation module along with the factor-graph structure, populated by odometry and ground-to-aerial factors over time, allows global localisation under dense canopies. We validate the performance of our method through extensive experiments in three forest scenarios, demonstrating its global localisation capability and superiority over alternative state-of-the-art in terms of accuracy and robustness in these challenging environments. Experimental results show that our proposed method can achieve drift-free localisation with bounded positioning errors, ensuring reliable and safe robot navigation through dense forests.

17:15-17:20, Paper WeDT1.8
Tiny LiDARs for Manipulator Self-Awareness: Sensor Characterization and Initial Localization Experiments

Caroleo, Giammarco	University of Oxford
Albini, Alessandro	University of Oxford
De Martini, Daniele	University of Oxford
Barfoot, Timothy	University of Toronto
Maiolino, Perla	University of Oxford
Keywords: Localization, Object Detection, Segmentation and Categorization Abstract: For several tasks, ranging from manipulation to inspection, it is beneficial for robots to localize a target object in their surroundings. In this paper, we propose an approach that utilizes coarse point clouds obtained from miniaturized VL53L5CX Time-of-Flight (ToF) sensors (tiny LiDARs) to localize a target object in the robot’s workspace. We first conduct an experimental campaign to calibrate the dependency of sensor readings on relative range and orientation to targets. We then propose a probabilistic sensor model, which we validate in an object pose estimation task using a Particle Filter (PF). The results show that the proposed sensor model improves the performance of the localization of the target object with respect to two baselines: one that assumes measurements are free from uncertainty and one in which the confidence is provided by the sensor datasheet.


WeDT2	402
Actuation and Joint Mechanisms	Regular Session
Chair: Kopicki, Marek	Poznan University of Technology
Co-Chair: Wang, Yunjiang	Zhejiang University

16:40-16:45, Paper WeDT2.1
Improving the Energy Efficiency by Using Quasi-Passive-Dynamics-Based Elastic Actuator

Chen, Ruigang	Guangdong Technion – Israel Institute of Technology
Lin, Tongchen	Guangdong Technion-Israel Institute of Technology
Or, Yizhar	Technion
Liu, Mingyi	Guangdong Technion - Israel Institution of Technology
Keywords: Actuation and Joint Mechanisms, Mechanism Design, Motion Control Abstract: Electric actuators are commonly used in machines and optimal energy efficiency can be achieved only at certain speeds. This paper introduces the Quasi-Passive-Dynamics-based Elastic Actuator (QPD-EA), which enhances energy efficiency by leveraging passive dynamics and energy recycling. Instead of directly linking the motor to the robot arm, the motor charges a spring at its optimal energy efficiency speed, and a clutch mechanism controls energy transfer between the spring and the arm, enabling efficient movement in any desired direction. The design, modeling, prototype, and experimental validation are detailed, showing that the QPD-EA consumes 86% less energy than conventional actuators for the same motion, reducing energy use from 154 mJ to 21 mJ. This approach can significantly improve energy efficiency in tasks like pick-and-place operations and robotic arms.

16:45-16:50, Paper WeDT2.2
Design of a Hyper-Redundant Manipulator with Zigzag Mechanism Doublet

Wang, Yunjiang	Zhejiang University
Yang, Keji	Zhejiang University
Jin, Haoran	Zhejiang University
Keywords: Actuation and Joint Mechanisms, Mechanism Design, Redundant Robots Abstract: Continuum and snake-like hyper-redundant robots perform circular bending motions, enable them to navigate obstacles and operate in confined spaces, making them ideal for applications such as industrial inspection and minimally invasive surgery. However, their performance often diminishes at the distal end, particularly in terms of positioning accuracy and output force during high-curvature tasks, which are considered their specialty. This paper introduces a novel hyper-redundant manipulator composed of zigzag-jointed folding links and intermediary connecting links. Each basic unit, termed a zigzag mechanism doublet (ZMD), consists of two folding links and their interacting connecting links, providing symmetric kinematic inputs and outputs. The output of one ZMD unit serves as the input for the next, enabling the entire manipulator to be actuated by the initial unit. By connecting multiple ZMD units, the manipulator approximates circular bending motion. This design outperforms traditional snake-like hyper-redundant manipulators in three aspects. First, it achieves bending motion through structural constraints, eliminating the need for tendons or other appendages to actuate multiple joints. Second, each unit extends the manipulator’s motion range, rather than distributing a limited bending range across the entire structure. Third, the ZMD chain achieves constant curvature in discrete form, enhancing the manipulator’s payload capability throughout its full motion range, even in extreme bending configurations. Experimental evaluations were conducted on a 3D-printed prototype and compared with typical articulated and continuum manipulators. The ZMD-based manipulator demonstrated a mean repeatability of 0.32 mm and a payload of 200 g, offering a promising solution for operations in constrained environments.

16:50-16:55, Paper WeDT2.3
Novel Articulated Lead Screw Linear Actuator Enabled by Transforming Linkage Mechanism

Unde, Jayant	Nagoya University
Colan, Jacinto	Nagoya University
Hasegawa, Yasuhisa	Nagoya University
Keywords: Actuation and Joint Mechanisms, Mechanism Design Abstract: This paper introduces a novel linear actuator with a high extension ratio, achieved through a unique planar transforming linkage mechanism (TLM) that serves as an articulated lead screw. This mechanism forms a rigid beam when extended and compactly coils when not in use, eliminating the need for an interlocking mechanism and thereby enhancing durability and reducing maintenance costs. The paper also discusses the comprehensive design of the linear actuator, including the articulated lead screw and driving transformation mechanism, leveraging the kinematic properties of the planar TLM. A proof-of-concept prototype is manufactured and assessed for its load capacity, accuracy, repeatability, and efficiency. Unlike conventional actuators constrained by a lower extension ratio, the proposed actuator’s compactness and transverse load resilience make it ideal for space-constrained applications such as medical devices, space robotics, and factory automation. Overall, the proposed linear actuator represents a significant advancement in the field of linear actuators and has the potential to enable a wide array of applications that were previously unfeasible with conventional designs.

16:55-17:00, Paper WeDT2.4
Underactuated Dexterous Robotic Grasping with Reconfigurable Passive Joints

Kopicki, Marek	Poznan University of Technology
Ansary, Sainul Islam	Indian Institute of Technology Kharagpur
Tolomei, Simone	University of Pisa
Angelini, Franco	University of Pisa
Garabini, Manolo	Università Di Pisa
Skrzypczynski, Piotr	Poznan University of Technology
Keywords: Actuation and Joint Mechanisms, Tendon/Wire Mechanism, Dexterous Manipulation Abstract: We introduce a novel reconfigurable passive joint (RP-joint), which has been implemented and tested on an underactuated three-finger robotic gripper. RP-joint has no actuation, but instead it is lightweight and compact. It can be easily reconfigured by applying external forces and locked to perform complex dexterous manipulation tasks, but only after tension is applied to the connected tendon. Additionally, we present an approach that allows learning dexterous grasps from single examples with underactuated grippers and automatically configures the RP-joints for dexterous manipulation. This is enhanced by integrating kinaesthetic contact optimization, which improves grasp performance even further. The proposed RP-joint gripper and grasp planner have been tested on over 370 grasps executed on 42 IKEA objects and on the YCB object dataset, achieving grasping success rates of 80% and 87%, on IKEA and YCB, respectively.

17:00-17:05, Paper WeDT2.5
Effective Data-Driven Joint Friction Modeling and Compensation with Physical Consistency

Dai, Rui	Istituto Italiano Di Tecnologia
Rossini, Luca	Istituto Italiano Di Tecnologia
Laurenzi, Arturo	Istituto Italiano Di Tecnologia
Patrizi, Andrea	Istituto Italiano Di Tecnologia
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Keywords: Actuation and Joint Mechanisms, Calibration and Identification, Machine Learning for Robot Control Abstract: The complex nonlinear nature of friction in real-world applications renders traditional physical models inadequate for accurately capturing its characteristics. While numerous learning-based approaches have addressed this challenge, they often lack interpretability and fail to uphold the physical guarantees essential for reliable modeling. Additionally, existing structured data-driven methods, despite their efficacy in handling nonlinear systems, seldom account for the specific traits of friction or ensure passivity. To overcome these limitations, we introduce a structured Gaussian Process (GP) model that adheres to the physical consistency of joint friction torque, enabling data-driven modeling in function space that accurately captures Coulomb and viscous friction characteristics while further guaranteeing passivity. We experimentally validate our approach by deploying the friction model on a two-degree-of-freedom (2-DoF) leg prototype. Our approach exhibits robust performance in the presence of non-passive and high-noise data. Experimental results demonstrate that our joint friction model achieves enhanced data efficiency, superior friction compensation performance, and improved trajectory tracking dynamics compared to other friction models.

17:05-17:10, Paper WeDT2.6
Output Feedback with Feedforward Robust Control for Motion Systems Driven by Nonlinear Position-Dependent Actuators (I)

Al Saaideh, Mohammad	Memorial University of Newfoundland
Boker, Almuatazbellah	Virginia Tech
Al Janaideh, Mohammad	University of Guelph
Keywords: Actuation and Joint Mechanisms Abstract: This paper introduces a control approach for a motion system driven by a class of actuators with multiple nonlinearities. The proposed approach presents a combination of a feedforward controller and an output feedback controller to achieve a tracking performance of the motion system. The feedforward controller is mainly proposed to address the actuator dynamics and provide a linearization without requiring measurements from the actuator. Subsequently, the output feedback controller is designed using the measured position to achieve a tracking objective for a desired reference signal, considering the unknown nonlinearities in the system and the error due to the open-loop compensation using feedforward control. The efficacy of the proposed control approach is validated through three applications: reluctance actuator, electrostatic microactuator, and magnetic levitation system. Both simulation and experimental results demonstrate the effectiveness of the proposed control approach in achieving the desired reference signal with minimal tracking error, considering that the actuator and system nonlinearities are unknown.

17:10-17:15, Paper WeDT2.7
WAVE: Worm Gear-Based Adaptive Variable Elasticity for Decoupling Actuators from External Forces

Selvamuthu, Moses Gladson	Yamagata University
Takahashi, Tomoya	OMRON SINIC X Corporation
Tadakuma, Riichiro	Yamagata University
Tanaka, Kazutoshi	OMRON SINIC X Corporation
Keywords: Actuation and Joint Mechanisms, Mechanism Design, Compliant Joints and Mechanisms Abstract: Robotic manipulators capable of regulating both compliance and stiffness offer enhanced operational safety and versatility. Here, we introduce Worm Gear-based Adaptive Variable Elasticity (WAVE), a variable stiffness actuator (VSA) that integrates a non-backdrivable worm gear. By decoupling the driving motor from external forces using this gear, WAVE enables precise force transmission to the joint, while absorbing positional discrepancies through compliance. WAVE is protected from excessive loads by converting impact forces into elastic energy stored in a spring. In addition, the actuator achieves continuous joint stiffness modulation by changing the spring's precompression length. We demonstrate these capabilities, experimentally validate the proposed stiffness model, show that motor loads approach zero at rest—even under external loading—and present applications using a manipulator with WAVE. This outcome showcases the successful decoupling of external forces. The protective attributes of this actuator allow for extended operation in contact-intensive tasks, and for robust robotic applications in challenging environments.

17:15-17:20, Paper WeDT2.8
A Compact Robotic Wrist with Embedded Torque Sensing for Peg-In-Hole Tasks

Tsai, Yi-Shian	National Cheng Kung University
Chen, Yi-Hung	National Cheng Kung University
Lan, Chao-Chieh	National Taiwan University
Keywords: Actuation and Joint Mechanisms, Compliant Assembly, Compliant Joints and Mechanisms Abstract: This paper presents the design and experimental validation of a torque-controlled robotic wrist for peg-in-hole tasks. The proposed wrist features a serial pitch-yaw joint configuration that enhances dexterity while maintaining compactness. The design integrates stepper motors, harmonic geartrains, and compliant mechanisms to optimize torque output and control accuracy. A compliant pulley and a compliant cap are introduced, enabling embedded torque sensing without the need for external sensors, thereby reducing system complexity and improving response time. Experimental results demonstrate the effectiveness of the wrist in torque accuracy and misalignment correction during peg-in-hole assembly, highlighting the benefits of the compliant-driven torque sensing approach. Compared to existing robotic wrists, the proposed design achieves a higher torque density. The findings contribute to advancing robotic wrist technology, particularly in applications requiring precise force modulation, high dexterity, and adaptable compliance.


WeDT3	403
Soft Sensors and Actuators 4	Regular Session
Chair: Kamezaki, Mitsuhiro	The University of Tokyo
Co-Chair: Zhao, Huichan	Tsinghua University

16:40-16:45, Paper WeDT3.1
Towards the Benchmarking of Embodied Sensors for Pose Tracking in Octopus-Inspired Robotic Arms

Martini, Michele	Italian Institute of Technology
Pei, Guanran	École Polytechnique Fédérale De Lausanne
Ansari, Yasmin	Italian Institute of Technology
Solfiti, Emanuele	IIT - Fondazione Istituto Italiano Di Tecnologia
Hughes, Josie	EPFL
Mazzolai, Barbara	Istituto Italiano Di Tecnologia
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design Abstract: Proprioceptive sensing plays a crucial role in robotics, enabling closed-loop control approaches that are essential for autonomous applications. In Soft Robotics, the development and integration of sensors is even more challenging due to the compliant nature of soft bodies. Moreover, underwater environments pose additional difficulties, as sensors require to be properly embedded and sealed into the soft body of the robot. The novelty of this work lies in benchmarking different sensing technologies on a continuum soft robot to systematically assess their suitability as effective sensing approaches, both in air and underwater environments. This work presents two proprioceptive sensors (FBG optical sensor and IMUs system) embedded in an octopus-inspired robotic arm and are then tested using our proposed experimental protocol. The results underscore the system's ability to reliably and repeatably capture data and provide a valuable guideline for the community to adopt in order to test novel sensing modalities in soft robotics. These developments are pivotal in advancing the deployment of soft robotic systems in both above- and under-water settings, facilitating tasks ranging from infrastructure inspection to marine life studies.

16:45-16:50, Paper WeDT3.2
Body-Temperature-Responsive Balloon Actuator for Adaptive In-Ear Microneedle Electrode Deployment

Zhao, Ruizhou	The Chinese University of Hong Kong
Yue, Wenchao	The Chinese University of Hong Kong
Li, Entong	The Chinese University of Hong Kong
Bai, Chengxi	NUS
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Wearable Robotics Abstract: Conventional in-ear electrodes face challenges such as inconsistent skin contact, motion artifacts, and discomfort, limiting their reliability in dynamic conditions. To overcome these limitations, this study presents a body-temperature responsive in-ear balloon actuator (BBA) based on liquid-to-vapor phase change, integrated with silver microneedle electrodes for stable electrophysiological monitoring. The dual-layer balloon structure encapsulates a phase-change fluid core, expanding at 36–37°C to ensure conformal skin contact while minimizing motion artifacts and mechanical pressure. Among 3 tested designs, the dual-material design proved optimal, achieving a peak insertion force of 0.05 N and maintaining impedance fluctuations below 5%. A wearable system was further validated through dynamic tests, demonstrating a 20% reduction in motion artifacts compared to conventional electrodes. These findings highlight the actuator’s potential for stable and comfortable wearable electrophysiological monitoring.

16:50-16:55, Paper WeDT3.3
A Lightweight 3-Axis Permanent Magnetic Sponge-Based Self-Adapting Tactile Sensor

Wang, Yushi	Waseda University
Abhyankar, Devesh	Waseda University
Iwamoto, Yuhiro	Nagoya Institute of Technology
Cheng, Zhengxue	Shanghai Jiao Tong University
Zhao, Ruotong	The University of Tokyo
Sugano, Shigeki	Waseda University
Kamezaki, Mitsuhiro	The University of Tokyo
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Tactile sensors are indispensable in robotic systems because they deliver vital contact information during environmental interactions. In our work, we leverage the variable compliance of a porous material—where different interaction forces induce varying degrees of compliance—to achieve self-adapting tactile sensing. This distinctive non-linear characteristic allows its sensitivity to be tuned over a range from 0.008 mT/N to 0.045 mT/N. After coating with a magnetic polymer, the porous material functions as a 3-axis magnetic sensing medium. Its length and width are set at 30 mm and 35 mm respectively to accommodate the printed circuit board. To preserve the overall measuring range, it is designed with a thickness of 15 mm. This thickness enables monitoring of the volumetric changes due to the enhanced compliance, which is suitable for three-dimensional shape recognition. In this work, we present the design, fabrication, experimental characterization, and applications of an lightweight 3-axis magnetic sponge sensor with overall dimensions of 30 mm (width) × 35 mm (length) × 17 mm (height) and a detection range of 60 N. Notably, the sensing material weighs only 2 g, thanks to its porous structure.

16:55-17:00, Paper WeDT3.4
Design and Characterization of a Thermal-Electrostatic Dual-Modal Soft Pouch Motor

Wu, Chuang	Xi'an University of Architecture and Technology
Wang, Youzhan	University of Chinese Academy of Sciences
Li, Xiaozheng	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Li, Qingbiao	The University of Macau
Digumarti, Krishna Manaswi	Queensland University of Technology
Cao, Chongjing	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Soft Robot Applications Abstract: Pouch motors continue to attract research attention owing to their simple fabrication process, low cost, and excellent energy density. Existing pouch motors based on the liquid-gas phase transition (LGPT) principle exhibit significant stroke and force outputs but suffer from slow responses. Pouch motors that rely on the electrohydraulic actuation (EHA) demonstrate rapid responses and broad bandwidths, yet their stroke/force outputs remain limited. This paper presents a novel thermal-electrostatic dual-modal soft pouch motor (TES-SPM) that synergistically combines the advantages of LGPT and EHA. The output performance of the TES-SPM in both the LGPT and EHA modes is characterized by extensive experiments. The effects of key parameters including the liquid volumes and actuation voltage/current amplitudes are also investigated in experiments. In the EHA mode, the TES-SPM can exert a stroke of 2.5 mm within a rapid ~ 0.06 s, while in the LGPT mode, it is able to exhibit a maximum stroke of 22.8 mm and a blocking force of ~ 80 N. A novel folding fan-inspired actuator and accordion-inspired soft gripper based on the serially attached TES-SPM units are developed to demonstrate the potentials of soft robotic applications. The TES-SPM designed in this paper is envisioned to have promising applications in industrial soft grippers and wearable assistive devices.

17:00-17:05, Paper WeDT3.5
Extreme-Hydrostatic-Pressure Resilient Dielectric Elastomer Actuator for Propeller Propulsion

Du, Boyuan	Tsinghua University
Zhou, Liang	Tsinghua University
Dong, Xuguang	Tsinghua University
Li, Xinge	Zhejiang University
Chen, Tong	Zhejiang University
Li, Tiefeng	Zhejiang University
Liu, Xin-Jun	Tsinghua University
Zhao, Huichan	Tsinghua University
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Exploring high hydrostatic pressure environments such as deep sea presents significant challenges to robotic devices, for they often rely on strong yet heavy and costly protective structures to shield components from being crushed by the extreme pressure. To dismiss the need for bulky protection shells for actuation devices, we reported an extreme-hydrostatic-pressure resilient rotary dielectric elastomer actuator (DEA) for propulsion application in deep-sea pressure condition. DEAs are inherently resistant to damage caused by external pressure, due to their uniform and cavity-free structure. In this study, we analyzed the material properties of the DEA’s elastomer, evaluated the rotary actuator's lifespan at up to 110 MPa high-pressure liquid conditions, and output performance under both ambient and 30 MPa (equivalent to 3,000 m underwater). Our results show that the rotary actuator maintained functionality at such hydrostatic pressure, with a lifespan exceeding 300,000 cycles and a high rotational output speed of 820 rpm. The rotary actuator was subsequently used to drive the robot with a propeller in a simulated deep-sea pressure fluidic environment, demonstrating our DEA’s performance as well as design simplicity for deep-sea applications without protection structures. While high hydrostatic pressure negatively impacted the actuator's lifespan and slightly reduced its dynamic performance, our results confirmed that the DEA was a viable solution for deep-sea exploration, laying a solid foundation for the further development of DEA-powered devices for underwater missions.

17:05-17:10, Paper WeDT3.6
Soft Electrohydraulic Actuators with Intrinsic Electroadhesion

Shibuya, Takumi	The University of Electro-Communications
Kubota, Momoki	The University of Electro-Communications
Shintake, Jun	The University of Electro-Communications
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design Abstract: This paper presents a soft electrohydraulic actuator integrated with electrically controlled adhesion. Soft electrohydraulic actuators are a type of soft actuation technology known for their versatility and promising features, enabling the creation of diverse soft robotic systems. Integrating electroadhesion functionality into this actuation technology is expected to further enhance its versatility by making it multifunctional. In the actuator proposed in this study, electroadhesion is incorporated by modifying a partial domain of the electrode to have an interdigitated shape, which generates not only actuation but also electrostatic attractive forces to nearby objects simultaneously. Additionally, the geometry of the pouch is modified from a rectangular to a non-rectangular shape to stabilize actuated deformation. The experimental results clarified the actuation performance and electroadhesion forces of the proposed actuator, while a 23% improvement in holding force was observed in the form of a gripper, demonstrating the effectiveness of the actuator with intrinsic electroadhesion.

17:10-17:15, Paper WeDT3.7
An Open-Source Snake Hole-Digging Inspired Safety-Critical Insertion Planning and Replanning Framework for Continuum Robots

Ji, Guanglin	The Chinese University of HongKong
Sun, Zhenglong	Chinese University of Hong Kong, Shenzhen
Keywords: Flexible Robotics, Collision Avoidance, Biologically-Inspired Robots Abstract: Continuum robots, with their slender, flexible structures, are increasingly utilized for navigating confined spaces via follow-the-leader (FTL) motion to minimize trauma. However, traditional FTL algorithms fail to adjust the entire robot configuration, including the tail, and struggle with navigation in dynamic environments requiring real-time re-planning. We propose a novel FTL motion planning framework inspired by snake hole-digging, enabling optimal shape configuration during insertion. Our approach outperforms the baseline FTL by 75.36% in three environments (circular, confined circular, maze) with six tests. The framework integrates control barrier functions (CBF) and quadratic programming (QP) for real-time obstacle avoidance, with control Lyapunov functions (CLF) ensuring minimal deviation. The target reaching error decreases by 48.43% when using CLF-CBF-QP compared to CBF-QP in four dynamic circular environments with obstacles from different directions. This work, with open-source code, provides a robust solution for continuum robot FTL motion planning in dynamic environments.


WeDT4	404
VR and Vision-Based Planning	Regular Session
Co-Chair: Wang, Siqi	Shanghai Jiao Tong University

16:40-16:45, Paper WeDT4.1
Effect of Haptic Feedback on Avoidance Behavior and Visual Exploration in Dynamic VR Pedestrian Environment

Ishibashi, Kyosuke	The University of Tokyo
Saito, Atsushi	The University of Tokyo
Tun, Zin	Stanford University
Ray, Lucas	Stanford University
Coram, Megan	Stanford University
Sakurai, Akihiro	University of Tokyo
Okamura, Allison M.	Stanford University
Yamamoto, Ko	University of Tokyo
Keywords: Virtual Reality and Interfaces, Haptics and Haptic Interfaces, Modeling and Simulating Humans Abstract: Human crowd simulation in virtual reality (VR) is a powerful tool with potential applications including emergency evacuation training and assessment of building layout. While haptic feedback in VR enhances immersive experience, its effect on virtual walking behavior in dense and dynamic pedestrian flows is unknown. Through a user study, we investigated how haptic feedback changes user walking motion in crowded pedestrian flows in VR. The results indicate that haptic feedback changed users' collision avoidance movements, as measured by increased walking trajectory length and change in pelvis angle. The displacements of users' lateral position and pelvis angle were also increased in the instantaneous response to a collision with a non-player character (NPC), even when the NPC was inside the field of view. Haptic feedback also enhanced users' awareness and visual exploration when an NPC approached from the side or back. Furthermore, variation in walking speed was increased by the haptic feedback. These results suggest that the haptic feedback enhances users' sensitivity to collisions in VR environments.

16:45-16:50, Paper WeDT4.2
Adaptive Visual Servoing Control Barrier Function of Robotic Manipulators with Uncalibrated Camera

Zhao, Jianing	Shanghai Jiao Tong University
Feng, Mingyang	Shanghai Jiao Tong University
Zhang, Yuepeng	Shanghai Jiao Tong University
Wang, Siqi	Shanghai Jiao Tong University
Yin, Xiang	Shanghai Jiao Tong Univ
Keywords: Visual Servoing, Robot Safety, Formal Methods in Robotics and Automation Abstract: This paper investigates the problem of safe visual servoing control of manipulators using an uncalibrated eye-in-hand camera based on control barrier functions (CBFs). Traditional CBFs are defined in the workspace, corresponding to the global coordinates of the base frame. However, when the camera's position or orientation is adjusted for a better field of view, it becomes uncalibrated, making it challenging to obtain the precise positions of the robot and obstacles using onboard sensors like a camera. To address this, we propose a novel emph{visual servoing control barrier function} (VS-CBF) for manipulators, which depends only on the image and depth data sensed by an RGB-D camera. Given an uncalibrated camera, we develop an adaptive estimator for the unknown camera parameters. Based on this estimator, we also design a kinematic visual servoing control law as a nominal controller, ensuring the convergence of the robotic system. The safe controller is then obtained by solving a quadratic programming problem that incorporates the designed VS-CBF and the nominal controller. Finally, experimental results conducted on a UR3 manipulator are presented to demonstrate the effectiveness of our approach.

16:50-16:55, Paper WeDT4.3
VERAGMIL: Virtual Environment for Scooping Granular Foods with Imitation Learning Models

Ergogo, Amanuel	SANO Centre for Computational Personalized Medicine
Dall'Alba, Diego	University of Verona
Korzeniowski, Przemyslaw	Sano Centre for Computational Medicine
Keywords: Virtual Reality and Interfaces, Simulation and Animation, AI-Enabled Robotics Abstract: Robot-Assisted Feeding (RAF) systems are essential for assisting individuals with disabilities or motor impairments in eating tasks. Manipulating granular food items, such as rice and beans, poses significant challenges due to their dynamic physical properties. Learning from human demonstrations offers a promising solution, but acquiring high-quality demonstrations is complex. To address this, we present VERAGMIL, a framework that combines a high-fidelity simulator with an intuitive Virtual Reality (VR)interface for recording demonstrations and supporting different imitation learning methods. VERAGMIL provides a realistic environment for training RAF systems to handle granular materials, including robots, sensors, and various food items with distinct physical characteristics. We evaluate VERAGMIL by training three imitation learning models—BC, BC-RNN, and BCQ—on granular scooping and transporting tasks using both VR interface and 3D space mouse demonstrations, comparing them with a human-expert baseline. The models are assessed on success rate, spillage, generalization to unseen food items, and task completion time. Results show that VR-based demonstrations significantly outperform 3D space mouse data, with BCQ achieving the best overall performance, particularly in reducing spillage and approaching human performance. These findings underscore the effectiveness of our framework for training RAF systems in granular material handling. The code for our framework is publicly available at: https://github.com/AmanuelErgogo/VERAGMIL.git.

16:55-17:00, Paper WeDT4.4
Language-Guided Hierarchical Planning with Scene Graphs for Tabletop Object Rearrangement

Oh, Wooseok	Seoul National Univetsity
Kee, Hogun	Seoul National University
Oh, Songhwai	Seoul National University
Keywords: Task and Motion Planning, AI-Based Methods, Manipulation Planning Abstract: Spatial relationships between objects are key to achieving well-arranged scenes. In this paper, we address the robotic rearrangement task by leveraging these relationships to reach configurations that are both well-arranged and satisfying the given language goal. We propose a hierarchical planning framework that bridges the gap between abstract language inputs and concrete robotic actions. A scene graph is central to this approach, serving as both an intermediate representation and the state for high-level planning, capturing the relationships among objects effectively and reducing planning complexity. This also enables the proposed method to handle more general language goals. To achieve this, we leverage a large language model (LLM) to convert language goals into a scene graph, which becomes the goal for high-level planning. In high-level planning, we plan transitions from the current scene graph to the goal scene graph. To integrate high-level and low-level planning, we introduce a network that generates a physical configuration of objects from a scene graph. Low-level planning then verifies the high-level plan’s feasibility, ensuring it can be executed through robotic manipulation. Through experiments, we show that the proposed method handles general language goals effectively and produces human-preferred rearrangements compared to other approaches, demonstrating its applicability on real robots without requiring sim-to-real adjustments.

17:00-17:05, Paper WeDT4.5
P2 Explore: Efficient Exploration in Unknown Cluttered Environment with Floor Plan Prediction

Song, Kun	Shanghai Jiao Tong University
Chen, Gaoming	Shanghai Jiao Tong University
Tomizuka, Masayoshi	University of California
Zhan, Wei	Univeristy of California, Berkeley
Xiong, Zhenhua	Shanghai Jiao Tong University
Ding, Mingyu	University of North Carolina at Chapel Hill
Keywords: View Planning for SLAM, Mapping, SLAM Abstract: Robot exploration aims at the reconstruction of unknown environments, and it is important to achieve it with shorter paths. Traditional methods focus on optimizing the visiting order of frontiers based on current observations, which may lead to local-minimal results. Recently, by predicting the structure of the unseen environment, the exploration efficiency can be further improved. However, in a cluttered environment, due to the randomness of obstacles, the ability to predict is weak. Moreover, this inaccuracy will lead to limited improvement in exploration. Therefore, we propose FPUNet which can be efficient in predicting the layout of noisy indoor environments. Then, we extract the segmentation of rooms and construct their topological connectivity based on the predicted map. The visiting order of these predicted rooms is optimized which can provide high-level guidance for exploration. The FPUNet is compared with other network architectures which demonstrates it is the SOTA method for this task. Extensive experiments in simulations show that our method can shorten the path length by 2.18% to 34.60% compared to the baselines.

17:05-17:10, Paper WeDT4.6
VRobotix: A Scalable and Cost-Effective Virtual-Reality-Based Robotic Manipulation Dataset Generation Framework

Fang, Xinmin	University of Colorado Denver
Li, Zheshuo	University of Colorado Denver
Tao, Lingfeng	Kennesaw State University
Li, Zhengxiong	University of Colorado Denver
Keywords: Data Sets for Robot Learning, Virtual Reality and Interfaces Abstract: Large-scale, diverse datasets are essential for training robust learning-based robotic manipulation models; however, their acquisition typically requires controlled environments and specialized hardware in research laboratories. This paper presents VRobotix, a virtual reality (VR)-based framework that enables cost-effective and scalable robotic dataset generation through immersive human-in-the-loop control within a physics-accurate robot simulation. By leveraging off-the-shelf VR headsets (e.g., Oculus Quest 3), VRobotix eliminates the need for physical robots while supporting a URDF-compatible, physics-based simulator that accommodates adaptable robotic platforms and egocentric control interfaces, including handheld controllers and body posture tracking. Benefiting from the physics-based simulation, a unique contribution of VRobotix is the replay module, which can re-generate synchronized multi-modal dataset (kinematic states, RGB-D streams) with multiple dataset formats based on the repayable trajectory, supporting various robotic applications. Additionally, an imitation learning module is developed to train control policies using the data collected by VRobotix. Experiments on three initial tasks—pushing, grasping, and stacking—demonstrate a high data collection success rate, averaging 92.0%. Furthermore, policies trained on just 50 trials achieve a 100% task success rate. VRobotix reduces infrastructure costs while generating ROS-compatible datasets, democratizing scalable robotic data acquisition.

17:10-17:15, Paper WeDT4.7
Engaging Mind and Body: An Immersive BCI Paradigm with Motion-Panoramic Virtual Reality

Zhang, Lianchi	University of Electronic Science and Technology of China
Lei, Mengxi	University of Electronic Science and Technology of China
Zhang, Jingting	University of Electronic Science and Technology of China
Huang, Zonghai	University of Electronic Science and Technology of China
Huang, Rui	University of Electronic Science and Technology of China
Cheng, Hong	University of Electronic Science and Technology
Keywords: Brain-Machine Interfaces, Virtual Reality and Interfaces Abstract: Brain-computer interface (BCI) is an important technology in developing the closed-loop brain training system for cognitive functional rehabilitation. Most of existing BCI paradigms have not ensured desired immersiveness of mind and body, thereby limiting participants' engagement in training tasks. In this paper, we propose a sensory-immersive BCI paradigm for decision-making with a novel motion-panoramic virtual-reality system, aiming for deep involvement of both mind and body in brain functional training. This paradigm integrates visual, auditory and motion multi-sensory stimulation by using the Gait Real-time Analysis Interactive Lab system to implement the modified ultimatum game for decision making. The designed paradigm is validated through three experimental studies, including the event-related potentials analysis, power spectral density analysis and the brain network analysis. They demonstrate that the designed paradigm can achieve better performance in motor-cognitive interaction and multi-sensory coordination, by effectively enhancing brain activation in visual, auditory, and motor processing regions, which can result in more effective activation of decision-making areas like the prefrontal cortex. This facilitate efficient integration of multimodal sensory inputs, increasing the activation of brain regions associated with decision-making. Compared to the existing paradigm, our paradigm can increase the number of high-intensity functional connections in the brain regions of participants by 62.8% (from 86 to 140), and the number of effective functional connections increased by 90.5% (from 252 to 480).

17:15-17:20, Paper WeDT4.8
The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

de Mayo, Mateo	Technical University of Munich
Cremers, Daniel	Technical University of Munich
Pire, Taihú	French Argentine International Center for Information and System
Keywords: Visual-Inertial SLAM, Data Sets for SLAM, Virtual Reality and Interfaces Abstract: Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.


WeDT5	407
Computer Architecture and Computational Geometry	Regular Session
Chair: Wang, Qiang	Harbin Institute of Technology, Shenzhen
Co-Chair: Zhong, Fangwei	Peking Univesity

16:40-16:45, Paper WeDT5.1
Sampling-Based Motion Planning with Discrete Configuration-Space Symmetries

Cohn, Thomas	Massachusetts Institute of Technology
Tedrake, Russ	Massachusetts Institute of Technology
Keywords: Computational Geometry, Motion and Path Planning Abstract: When planning motions in a configuration space that has underlying symmetries (e.g. when manipulating one or multiple symmetric objects), the ideal planning algorithm should take advantage of those symmetries to produce shorter trajectories. However, finite symmetries lead to complicated changes to the underlying topology of configuration space, preventing the use of standard algorithms. We demonstrate how the key primitives used for sampling-based planning can be efficiently implemented in spaces with finite symmetries. A rigorous theoretical analysis, building upon a study of the geometry of the configuration space, shows improvements in the sample complexity of several standard algorithms. Furthermore, a comprehensive slate of experiments demonstrates the practical improvements in both path length and runtime.

16:45-16:50, Paper WeDT5.2
Skeleton‐Guided Rolling‐Contact Kinematics for Arbitrary Point Clouds Via Locally Controllable Parameterized Curve Fitting

Wen, Qingmeng	Cardiff University
Lai, Yu-Kun	Cardiff University
Ji, Ze	Cardiff University
Svinin, Mikhail	Ritsumeikan University
Tafrishi, Seyed Amir	Cardiff Univerity
Keywords: Computational Geometry, Kinematics, Contact Modeling Abstract: Rolling contact kinematics plays a vital role in dexterous manipulation and rolling-based locomotion. Yet, in practical applications, the environments and objects involved are often captured as discrete point clouds, creating substantial difficulties for traditional motion control and planning frameworks that rely on continuous surface representations. In this work, we propose a differential geometry-based framework that models point cloud data for continuous rolling contact using locally parameterized representations. Our approach leverages skeletonization to define a rotational reference structure for rolling interactions and applies a Fourier-based curve fitting technique to extract and represent meaningful local geometric structure. We further introduce a novel 2D manifold coordinate system tailored to arbitrary surface curves, enabling local parameterization of complex shapes. The governing kinematic equations for rolling contact are then derived, and we demonstrate the effectiveness of our method through simulations on various object examples.

16:50-16:55, Paper WeDT5.3
RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories

Yan, Qingsong	Wuhan University
Wang, Qiang	Harbin Institute of Technology, Shenzhen
Zhao, Kaiyong	XGRIDS
Chen, Jie	Hong Kong Baptist University
Li, Bo	Hong Kong University of Science and Technology
Chu, Xiaowen	The Hong Kong University of Science and Technology (Guangzhou)
Deng, Fei	Wuhan University
Keywords: Computational Geometry, SLAM, Automation Technologies for Smart Cities Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

16:55-17:00, Paper WeDT5.4
Computationally Efficient FPGA-Based Large Language Model Inference for Real-Time Decision-Making in Robotic Systems

Zhang, Huaizhi	University of Essex
Al-Hasan, Tamim M.	University of Essex
Zhu, Xuqi	University of Essex
Zhu, Jiacheng	University of Essex
Si, Weiyong	University of Essex
McDonald-Maier, Klaus	University of Essex
Zhai, Xiaojun	University of Essex
Keywords: Computer Architecture for Robotic and Automation, Embedded Systems for Robotic and Automation, Software-Hardware Integration for Robot Systems Abstract: Integrating Large Language Models (LLMs) into modern robotic systems presents significant computational and energy constraint challenges, particularly for human-centred robotic applications. This paper presents a novel hardware optimization technique for deploying LLMs on resource-constrained embedded devices, achieving an up to 77% reduction in computational latency through an FPGA implementation in comparison to other popular embedded computing devices (e.g., CPU and GPUs). Additionally, we demonstrate our methodology by deploying a LLaMA 2-7B model on a Unitree Go2 robotic dog integrated with the proposed FPGA platform. The proposed optimization framework preserves real-time interaction capabilities while significantly reducing computational and energy overhead, facilitating efficient natural language processing for human-robot interaction in safety-critical and dynamic environments. Experimental results demonstrate that the FPGA-based LLaMA 2-7B implementation achieves up to 6.06-fold and 1.95-fold higher throughput compared to baseline CPU and GPU implementations while maintaining comparable inference accuracy. Furthermore, the proposed FPGA design surpasses existing state-of-the-art FPGA implementations, delivering a 30% improvement in computational efficiency.

17:00-17:05, Paper WeDT5.5
Fully Autonomous Dual Arm Aerial Delivery Robot for Intralogistics: The euROBIN Nancy Competition Flight Dataset

Suarez, Alejandro	University of Seville
Pozas-Guerra, Jorge	GRVC Robotics Lab, Universidad De Sevilla
Tapia, Raul	University of Seville
Ollero, Anibal	AICIA. G41099946
Keywords: Aerial Systems: Applications, Logistics, Software Architecture for Robotic and Automation Abstract: This paper presents the design, development, and validation of a fully autonomous dual-arm aerial robot capable of mapping, localizing, planning, and grasping parcels in an intra-logistics scenario. The aerial robot is intended to operate in a scenario comprising several supply points, delivery points, parcels with tags, and obstacles, generating the mission plan from the voice commands given by the user. The paper derives a transferability model of the scenario, the robot, and the task, so that the proposed system design can be generalized to different scenarios (environment transfer) and platforms (embodiment transfer). The proposed transferable system architecture allows the integration of software modules managed by the Aerial Delivery Robot Operations Manager (ADROM) through the Module Interface Instances (MII) that handle the requests and the signals involved during the execution of the operation. The performance of the developed system was evaluated as part of the euROBIN Nancy Competition, conducting more than 50 flight tests. The software modules are open source, making the flight dataset also publicly available.

17:05-17:10, Paper WeDT5.6
VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models

Kui, Wu	Beihang University
Xu, Shuhang	Beijing Normal University
Chen, Hao	City University of Macau
Wang, Churan	Peking University
Zhoujun, Li	Beihang University
Yizhou, Wang	Peking University
Zhong, Fangwei	Peking Univesity
Keywords: AI-Enabled Robotics, Visual Servoing, Visual Tracking Abstract: We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Visual-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs' reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs' limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by 72% with state-of-the-art RL-based approaches and 220% with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments.


WeDT6	301
Deep Learning in Grasping and Manipulation 4	Regular Session
Chair: Ferreira Duarte, Nuno	IST-ID
Co-Chair: Wang, Lujia	The Hong Kong University of Technology (Guangzhou)

16:40-16:45, Paper WeDT6.1
PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking Via Parallel Decoding

Song, Wenxuan	The Hong Kong University of Science and Technology (Guangzhou)
Chen, Jiayi	Hong Kong University of Science and Technology
Ding, Pengxiang	Westlake University
Zhao, Han	Westlake University
Zhao, Wei	Westlake University
Zhong, Zhide	The Hong Kong University of Science and Technology (Guangzhou)
Ge, Zongyuan	Monash University
Li, Zhijun	Harbin Institute of Technology
Wang, Donglin	Westlake University
Wang, Lujia	The Hong Kong University of Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Li, Haoang	Hong Kong University of Science and Technology (Guangzhou)
Keywords: Embodied Cognitive Science, Deep Learning in Grasping and Manipulation, AI-Enabled Robotics Abstract: Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2.52× execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.

16:45-16:50, Paper WeDT6.2
ArtGS: 3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects

Yu, Qiaojun	Shanghai Jiao Tong University
Yuan, Xibin	Shanghai Jiao Tong University
Jiang, Yu	Shanghai Jiao Tong University
Chen, Junting	ETH Zurich
Zheng, Dongzhe	Shanghai Jiao Tong University
Hao, Ce	University of California, Berkeley
You, Yang	Stanford University
Chen, Yixing	Shanghai Jiao Tong University
Mu, Yao	The University of Hong Kong
Liu, Liu	Hefei University of Technology
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconstruction, followed by reasoning with a vision-language model (VLM) to extract semantic and structural information, particularly the articulated bones. Through dynamic, differentiable 3DGS-based rendering, ArtGS optimizes the parameters of the articulated bones, ensuring physically consistent motion constraints and enhancing the manipulation policy. By leveraging dynamic Gaussian splatting, cross-embodiment adaptability, and closed-loop optimization, ArtGS establishes a new framework for efficient, scalable, and generalizable articulated object modeling and manipulation. Experiments conducted in both simulation and real-world environments demonstrate that ArtGS significantly outperforms previous methods in joint estimation accuracy and manipulation success rates across a variety of articulated objects. Additional images and videos are available on the project website: https://sites.google.com/view/artgs/home

16:50-16:55, Paper WeDT6.3
Improving Generalization of Language-Conditioned Robot Manipulation

Cui, Chenglin	Queen Mary University of London
Zhu, Chaoran	Queen Mary University of London
Oh, Changjae	Queen Mary University of London
Cavallaro, Andrea	Idiap, EPFL
Keywords: Deep Learning in Grasping and Manipulation, Recognition, Deep Learning Methods Abstract: The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-world robotic environments. Our method, fine-tuned with a few demonstrations, improves generalization capability and demonstrates zero-shot ability in real-robot manipulation scenarios.

16:55-17:00, Paper WeDT6.4
Occupancy-Belief Planning of Plant Manipulation for Staking

Li, Pusong	University of Illinois at Urbana-Champaign
Samarakoon Mudiyanselage, Bhagya Prasangi Samarakoon	Singapore University of Technology and Design
Muthugala Arachchige, Viraj Jagathpriya Muthugala	Singapore University of Technology and Design
Chittoor, Prithvi Krishna	Singapore University of Technology and Design
Elara, Mohan Rajesh	Singapore University of Technology and Design
Nagi, Rakesh	University of Illinois, Urbana-Champaign
Keywords: Deep Learning in Grasping and Manipulation, Manipulation Planning, Agricultural Automation Abstract: While agricultural robotics has made great strides in recent years, manipulation of plants for tasks such as staking and harvesting remains highly challenging due to the high variability in dynamics and deformable nature of plants. To address the challenges created by dynamics uncertainty, we develop a system applying an occupancy-belief planning concept to plant manipulation for staking. We first train a dynamics model that predicts a per-pixel probability that the plant occupies the corresponding slice in space after a drag action using a large set of simulators. This model is then used to plan a manipulation action that maximizes the probability areas swept by the stake tying tool's operating region are occupied by the plant, and minimize the probability areas swept by the non-operating side regions of the tool are occupied. We demonstrate our method both in simulation and with zero-shot sim-to-real transfer to a physical implementation. We show that adding consideration of belief through use of occupancy-belief allows our method to outperform both the visual foresight type approaches it is based on and other baselines and ablations, especially in the real-world case.

17:00-17:05, Paper WeDT6.5
GRASPLAT: Enabling Dexterous Grasping through Novel View Synthesis

Bortolon, Matteo	Istituto Italiano Di Tecnologia; Fondazione Bruno Kessler; Unive
Ferreira Duarte, Nuno	IST-ID
Moreno, Plinio	IST-ID
Poiesi, Fabio	Fondazione Bruno Kessler
Santos-Victor, José	Instituto Superior Técnico - University of Lisbon
Del Bue, Alessio	Istituto Italiano Di Tecnologia
Keywords: Deep Learning in Grasping and Manipulation, Dexterous Manipulation, Perception for Grasping and Manipulation Abstract: Achieving dexterous robotic grasping with multi-fingered hands remains a significant challenge. While existing methods rely on complete 3D scans to predict grasp poses, these approaches face limitations due to the difficulty of acquiring high-quality 3D data in real-world scenarios. In this paper, we introduce GRASPLAT, a novel grasping framework that leverages consistent 3D information while being trained solely on RGB images. Our key insight is that by synthesizing physically plausible images of a hand grasping an object, we can regress the corresponding hand joints for a successful grasp. To achieve this, we utilize 3D Gaussian Splatting to generate high-fidelity novel views of real hand-object interactions, enabling end-to-end training with RGB data. Unlike prior methods, our approach incorporates a photometric loss that refines grasp predictions by minimizing discrepancies between rendered and real images. We conduct extensive experiments on both synthetic and real-world grasping datasets, demonstrating that GRASPLAT improves grasp success rates up to 36.9% over existing image-based methods. Project page: https://mbortolon97.github.io/grasplat/

17:05-17:10, Paper WeDT6.6
LensDFF: Language-Enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

Feng, Qian	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Martinez Lema, David Sebastian	Technical University of Munich
Chen, Zhaopeng	University of Hamburg
Feng, Jianxiang	Technical University of Munich (TUM)
Keywords: Deep Learning in Grasping and Manipulation, Representation Learning, Grasping Abstract: Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches. See our website.

17:10-17:15, Paper WeDT6.7
Region-Centric 6-Dof Grasp Detection: A Data-Efficient Solution for Cluttered Scenes

Chen, Siang	Tsinghua University
Tang, Wei	Tsinghua University
Xie, Pengwei	Tsinghua University
Hu, Dingchang	Tsinghua University
Yang, Wenming	Tsinghua University
Wang, Guijin	Tsinghua University
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping Abstract: Robotic grasping, serving as the cornerstone of robot manipulation, is fundamental for embodied intelligence. Manipulation in challenging scenarios demands grasp detection algorithms with higher efficiency and generalizability. However, for general 6-Dof grasp detection, most data-driven methods directly extract scene-level features to generate grasp prediction, relying on a relatively heavy scene-level feature encoder and a significant amount of data with dense grasp labels for model training. In this letter, we propose a novel data-efficient 6-Dof grasp detection framework in cluttered scenes, named Region-Centric Grasp Detection (RCGD), consisting of an Iterative Search Module (ISM) and a Region Grasp Model (RGM). Concretely, ISM aims to retrieve potential region centers and aggregate multiple regions in a coarse-to-fine way. Then, RGM extracts aligned grasp-related embeddings and predicts grasps within these local regions. Benefiting from the region-centric paradigm and the training-free location strategy, RCGD significantly outperforms previous methods and shows minimal performance loss with even a very small portion of training data or labels. Furthermore, real-world robotic experiments in two distinct settings highlight the effectiveness of our method with a 95% success rate.


WeDT7	307
Motion and Path Planning 8	Regular Session
Chair: He, Yuqing	Shenyang Institute of Automation, Chinese Academy of Sciences
Co-Chair: Wang, Yafei	Shanghai Jiao Tong University

16:40-16:45, Paper WeDT7.1
Non-Differentiable Reward Optimization for Diffusion-Based Autonomous Motion Planning

Lee, Giwon	Korea Advanced Institute of Science and Technology (KAIST)
Park, Daehee	DGIST
Jeong, Jaewoo	KAIST
Yoon, Kuk-Jin	KAIST
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: Safe and effective motion planning is crucial for autonomous robots. Diffusion models excel at capturing complex agent interactions, a fundamental aspect of decision-making in dynamic environments. Recent studies have successfully applied diffusion models to motion planning, demonstrating their competence in handling complex scenarios and accurately predicting multi-modal future trajectories. Despite their effectiveness, diffusion models have limitations in learnable training objectives, as they focus on approximating data distributions rather than explicitly capturing the underlying decision-making dynamics. However, the crux of motion planning lies in non-differentiable downstream objectives, such as safety (collision avoidance) and effectiveness (goal-reaching), which conventional learning algorithms cannot directly optimize. In this paper, we propose a reinforcement learning-based training scheme for diffusion motion planning models, enabling them to effectively learn non-differentiable objectives that explicitly measure safety and effectiveness. Specifically, we introduce a reward-weighted dynamic thresholding algorithm that shapes a dense reward signal, facilitating more effective training and outperforming models trained with differentiable objectives. State-of-the-art performance on pedestrian datasets (CrowdNav, ETH-UCY) compared to various baselines demonstrates the versatility of our approach for safe and effective motion planning.

16:45-16:50, Paper WeDT7.2
Study on Thunniform Robot Propulsion by Tail-Flapping Speed Change

Nam Anh, Phan Huy	Korea Maritime & Ocean University
Choi, Hyeungsik	Korea Maritime and Ocean University
Zhang, Ruochen	Korea Maritime and Ocean University
Lee, Seong Chae	Korea Maritime & Ocean University
Jung, Dongwook	Korea Maritime and Ocean University
Cha, Taehyo	Korea Maritime and Ocean University
Kim, SooHwan	Korea Maritime and Ocean University
Keywords: Motion and Path Planning, Kinematics, Motion Control Abstract: Thunniform fish are renowned for their high-speed and efficient swimming, achieved through minimal body undulation and powerful tail-flapping. Inspired by this natural mechanism, this study develops a thunniform robotic fish and introduces a novel frequency profile control strategy for its tail-flapping propulsion. Unlike previous works that primarily focused on constant or low-speed tail-flapping, we propose and implement a new method that dynamically adjusts the tail-flapping speed using composite frequency profiles with varying frequencies and angle ranges within a single cycle. This approach aims to enhance surge speed and energy efficiency. The effectiveness of the proposed frequency profiles is systematically evaluated through three-dimensional incompressible viscous computational fluid dynamics (CFD) simulations. Results demonstrate that tailored multi-frequency tail-flapping profiles can significantly improve the surge performance of robot, providing valuable insights for the design of next-generation bio-inspired underwater propulsion systems. The novelty of this work lies in the formulation and validation of a frequency profile function that enables adaptive and efficient surge motion, advancing the field of bio-inspired aquatic robotics.

16:50-16:55, Paper WeDT7.3
Along-Edge Autonomous Driving on Curvy Roads Based on Frenet Frame: A Stable Hierarchical Planning Framework

Kang, Hong-yi	Shanghai Jiao Tong University
Lu, Junguo	Shanghai Jiaotong University
Li, Kaixiong	Shanghai Jiao Tong University
Zhang, Qing-Hao	Shanghai Jiao Tong University
Wang, Yafei	Shanghai Jiao Tong University
Keywords: Motion and Path Planning, Intelligent Transportation Systems, Optimization and Optimal Control Abstract: Along-edge driving, where an autonomous vehicle follows road edges, is increasingly common in urban environments and particularly challenging on curvy roads due to rapidly changing curvature. This paper presents a hierarchical trajectory planning framework that integrates Cartesian and Frenet frames to optimize along-edge motion. Cartesian planners struggle with nonlinear constraints, while Frenet-based approaches simplify edge-relative motion but often neglect trajectory curvature and suffer from non-convexity in obstacle avoidance. To address these limitations, our method employs an optimization-based planner with curvature constraints for precise along-edge motion and a sampling-based planner for stable lane changes when encountering obstacles. This novel approach maintains an along-edge distance within a precision of 0.1m, reducing error by 80% (from 0.7m to 0.1m). It also ensures smooth trajectory transitions and enhances stability in complex environments. Simulations and real-world experiments validate the framework's efficiency, achieving an average planning time of 1.22ms per frame while effectively balancing accuracy, feasibility, and real-time performance.

16:55-17:00, Paper WeDT7.4
Information Entropy-Assisted Hierarchical Framework for Unknown Environments Exploration

Changjun, Gu	Chongqing University of Posts and Telecommunications
Ro, Seupram	CQUPT
Chen, Yufei	Chongqing University of Posts and Telecommunications
Dong, Jiahua	Mohamed Bin Zayed University of Artificial Intelligence
Gao, Xinbo	Chongqing University of Posts and Telecommunications
Keywords: Motion and Path Planning, View Planning for SLAM, Autonomous Agents Abstract: Autonomous exploration of unknown environments is a critical task in robotic search and rescue operations. Recently, hierarchical planning frameworks have gained significant attention for their potential to enhance exploration efficiency. However, most existing approaches struggle with efficient exploration due to two key limitations: (1) neglecting subregion environmental information and (2) inconsistency between local and global paths. To overcome these challenges, we propose an Information Entropy-assisted Hierarchical Planning (IEHP) framework for efficient autonomous exploration. Specifically, we introduce an efficient subregion arrangement method that considers total travel distance, path similarity, and information entropy. Additionally, we propose a globally consistent frontier selection method to minimize redundant local paths, improving alignment between local and global planning. We validate the feasibility and efficiency of our approach through a series of complex simulation scenarios, with experimental results demonstrating the superiority of the proposed method.

17:00-17:05, Paper WeDT7.5
CAP: A Connectivity-Aware Hierarchical Coverage Path Planning Algorithm for Unknown Environments Using Coverage Guidance Graph

Shen, Zongyuan	Carnegie Mellon University
Shirose, Burhanuddin	Carnegie Mellon University
Sriganesh, Prasanna	Carnegie Mellon University
Travers, Matthew	Carnegie Mellon University
Keywords: Motion and Path Planning Abstract: Efficient coverage of unknown environments requires robots to adapt their paths in real time based on on-board sensor data. In this paper, we introduce CAP, a connectivity-aware hierarchical coverage path planning algorithm for efficient coverage of unknown environments. During online operation, CAP incrementally constructs a coverage guidance graph to capture essential information about the environment. Based on the updated graph, the hierarchical planner determines an efficient path to maximize global coverage efficiency and minimize local coverage time. The performance of CAP is evaluated and compared with five baseline algorithms through high-fidelity simulations as well as robot experiments. Our results show that CAP yields significant improvements in coverage time, path length, and path overlap ratio.

17:05-17:10, Paper WeDT7.6
Real-Time Optimization-Based Quadrotor Trajectory Generation with Kinodynamic Constraints in Unknown Environments

Zhao, Pinhui	Shenyang Institute of Automation, Chinese Academy of Sciences,
Li, Decai	Shenyang Institute of Automation, Chinese Academy of Sciences
Wu, Minjiang	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhou, Yuyang	Shenyang Institute of Automation Chinese Academy of Sciences
He, Yuqing	Shenyang Institute of Automation, Chinese Academy of Sciences
Keywords: Motion and Path Planning, Optimization and Optimal Control Abstract: Indoor disaster relief and rescue missions require quadrotors to fully exploit their maneuverability in real-time. However, the computational complexity induced by the underactuated kinodynamics conflicts with the rapid replanning requirement. For agile trajectory planning in cluttered and unknown environments, we propose a real-time optimization-based quadrotor trajectory generation method that integrates kinodynamic constraints in both trajectory search and trajectory optimization phases to fully exploit maneuverability. To further improve efficiency, we introduce a waypoints selection strategy to reduce the computational burden of kinodynamic trajectory optimization by transforming obstacle avoidance constraints into waypoint constraints, thereby enabling safe trajectory optimization in real-time. Specifically, kinodynamic trajectories are searched under kinodynamic constraints, providing reliable initial values for subsequent numerical optimization. Nextly, a waypoints selection algorithm, based on an estimation of trajectory variation during optimization, is introduced to preserve the obstacle-avoidance properties obtained during the search phase by limiting the variation with waypoint constraints. Finally, trajectory is segmented by waypoints with fixed time intervals each segment and then optimized under kinodynamic constraints, ensuring real-time optimization at the cost of time allocation optimality. We evaluated our method through simulation and experimentally validate its performance in cluttered and unknown environments. The competence of proposed method is also validated in real-world experiments.

17:10-17:15, Paper WeDT7.7
Benchmarking Shortcutting Techniques for Multi-Robot-Arm Motion Planning

Huang, Philip	Carnegie Mellon University
Shaoul, Yorai	Carnegie Mellon University
Li, Jiaoyang	Carnegie Mellon University
Keywords: Motion and Path Planning, Multi-Robot Systems, Dual Arm Manipulation Abstract: Generating high-quality motion plans for multiple robot arms is challenging due to the high dimensionality of the system and the potential for inter-arm collisions. Traditional motion planning methods often produce motions that are suboptimal in terms of smoothness and execution time for multi-arm systems. Post-processing via shortcutting is a common approach to improve motion quality for efficient and smooth execution. However, in multi-arm scenarios, optimizing one arm's motion must not introduce collisions with other arms. Although existing multi-arm planning works often use some form of shortcutting techniques, their exact methodology and impact on performance are often vaguely described. In this work, we present a comprehensive study quantitatively comparing existing shortcutting methods for multi-arm trajectories across diverse simulated scenarios. We carefully analyze the pros and cons of each shortcutting method and propose two simple strategies for combining these methods to achieve the best performance-runtime tradeoff. Video, code, and dataset are available at https://philip-huang.github.io/mr-shortcut/

17:15-17:20, Paper WeDT7.8
STORM: Spatial-Temporal Iterative Optimization for Reliable Multicopter Trajectory Generation

Jinhao, Zhang	Harbin Institute of Technology, Shenzhen
Zhexuan, Zhou	Harbin Institute of Technology, Shenzhen
Wenlong, Xia	Harbin Institute of Technology, Shenzhen
Gong, Youmin	Harbin Institution of Technology, Shenzhen
Mei, Jie	Harbin Insitute of Technology, Shenzhen
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Performance Evaluation and Benchmarking Abstract: Efficient and safe trajectory planning plays a critical role in the application of quadrotor unmanned aerial vehicles. Currently, the inherent trade-off between constraint compliance and computational efficiency enhancement in UAV trajectory optimization problems has not been sufficiently addressed. To enhance the performance of UAV trajectory optimization, we propose a spatial-temporal iterative optimization framework. Firstly, B-splines are utilized to represent UAV trajectories, with rigorous safety assurance achieved through strict enforcement of constraints on control points. Subsequently, a set of QP-LP subproblems via spatial-temporal decoupling and constraint linearization is derived. Finally, an iterative optimization strategy incorporating guidance gradients is employed to obtain high-performance UAV trajectories in different scenarios. Both simulation and real-world experimental results validate the efficiency and high-performance of the proposed optimization framework in generating safe and fast trajectories.


WeDT8	308
Micro/Nano Robots 7	Regular Session
Chair: Yue, Tao	Shanghai University
Co-Chair: Cui, Shuai	Nanyang Technological University

16:40-16:45, Paper WeDT8.1
Muscle-On-A-Chip: A Self-Healing Actuator Platform in Robotic Systems

Yin, Hongze	Shanghai University
Zhou, Jing	Shanghai University
Zhang, Juan	Shanghai University
Yang, Huiying	Shanghai University
Wang, Jiahao	Shanghai University, School of Mechatronic Engineering and Autom
Zhang, Yuyin	Shanghai University
Wang, Yue	Shanghai University
Liu, Na	Shanghai University, Shanghai, China
Yue, Tao	Shanghai University
Keywords: Biological Cell Manipulation, Micro/Nano Robots, Biologically-Inspired Robots Abstract: The regulation of muscle function is very important for tissue engineering and sports science. This paper presents a simple microfluidic chip platform and its control method to investigate the regulation of muscle function. By employing C2C12 cells as the model system for skeletal muscle research, these cells were inoculated onto the microfluidic chips and induced to differentiate into fully functional muscle tubes. Programmable actuation control enables localized strain gradients within the microfluidic platform, achieving differential mechanical regimes for functional modulation of integrated muscle constructs. The system implements mechanical conditioning to recapitulate exercise-induced myocyte damage and subsequent regenerative processes through controlled deformation protocols. Our radial-strain actuators generate 19.4% maximum principal strain, while axial-strain configurations achieve 8.3% baseline deformation. Dynamic input modulation enables precise strain reduction to 7.4% and 2.2%, respectively establishing differential mechanical regimes for simulating exercise-associated functional impairment (high-strain phase) and recovery processes (low-strain phase). This strain-programmable platform establishes a robust framework for investigating mechanobiological thresholds in functional muscle regeneration.

16:45-16:50, Paper WeDT8.2
Modeling and Simulation of Single-Micropipette Cell Rotation for Imitation Learning

Wang, Zefu	Nankai University
Hua, Yuchen	Nankai University
Gong, Huiying	Nankai University
Zhang, Yujie	Nankai University
Yang, Zhanli	Nankai University
Liu, Yaowei	Nankai University
Zhao, Xin	Nankai University
Sun, Mingzhu	Nankai University
Keywords: Biological Cell Manipulation, Automation at Micro-Nano Scales, Imitation Learning Abstract: Cell rotation plays a crucial role in micromanipulation. Among manual cell rotation techniques, single-micropipette cell rotation is widely adopted due to its high efficiency and flexibility. However, there is currently no method capable of achieving automated single-micropipette cell rotation. In this study, we developed the first three-dimensional (3D) simulation system for single-micropipette cell rotation. Based on this simulation system, we successfully achieved single-micropipette cell rotation imitation learning (IL) for the first time. Specifically, we first analyze the forces acting on cells in the fluid, establishing a dynamic model that describes the cell’s behavior in response to the flow velocity at the holding micropipette’s orifice, the relative position of the micropipette, and time. We then developed the cell rotation simulation environment by discretizing the model and designing the simulation’s cell and holding micropipette models based on real-world conditions. Finally, we designed a network architecture for IL using this model, achieving single-micropipette cell rotation in simulation. The results demonstrate that the simulation system exhibits a relative error range of 5.34% to 12.21% compared to real-world experiments, indicating a high degree of accuracy. Additionally, the single-micropipette cell rotation task achieved a success rate of 69% with an average completion time of 17.13 seconds, closely matching the expert data’s average time of 17.69 seconds, confirming the feasibility of the simulation system.

16:50-16:55, Paper WeDT8.3
Automatic Alignment of the Micropipette for Efficient and Precise Cell Micromanipulation

Cui, Shuai	Nanyang Technological University
Ang, Wei Tech	Nanyang Technological University
Keywords: Biological Cell Manipulation, Automation at Micro-Nano Scales Abstract: Precise alignment of the micropipette tip is crucial for robotic cell micromanipulation, enabling delicate procedures such as cell transfer, rotation, and immobilization. However, due to the limited field of view and depth perception under high-magnification microscopy, it poses significant challenges in accurately identifying misalignment and effectively controlling micropipette motion for adjustment. This paper comprehensively analyzes and addresses the misalignment problem, particularly that caused by the improper inclination angle of the micropipette holder. A vision-guided robotic control strategy for automatic micropipette alignment is integrated into a 5-degree-of-freedom (5-DOF) micromanipulator, enabling autonomous detection, adjustment, and positioning of the micropipette. The proposed method ensures precise trajectory tracking and compensates for geometric uncertainties introduced by fabrication or installation errors. Experimental validation demonstrates that the proposed system achieved a mean absolute error below 3 μm for positioning the tip of the micropipette at the focal plane during the procedure of adjustment. Meanwhile, the robotic method required significantly less time to stably rotate the micropipette compared to manual operation for cell manipulation. The vertical alignment error was less than 3 μm along a 250 μm micropipette tip segment. These results confirm that the proposed approach significantly enhances speed, accuracy, and repeatability in micropipette-based micromanipulation, providing a robust solution for high-throughput biological experiments and clinical applications.

16:55-17:00, Paper WeDT8.4
Automated Dual-Micropipette Coordination Microinjection for Batch Zebrafish Larvae Based on Pose Estimation

Wang, Can	Nankai University
Liu, Rongxin	Nankai University
Gong, Huiying	Nankai University
Wang, Zengshuo	Nankai University
Zhou, Lu	Nankai University
Liu, Yaowei	Nankai University
Zhao, Xin	Nankai University
Sun, Mingzhu	Nankai University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Visual Servoing Abstract: Zebrafish are widely used in the biomedical field, as an ideal model for microinjection. In automated zebrafish microinjection, posture adjustment is the first and key step, which takes a lot of skill, and injection success assessment is a challenging task. Constrained by these two aspects, it is difficult to further enhance the efficiency and success rate of injection. In this study, we propose an automated dual-micropipette coordination microinjection system. Zebrafish are randomly arranged in our system, reducing the operational difficulty, and the yolk is positioned using a pose estimation algorithm, followed by injection accomplished with dual-micropipette. Due to the reduction of posture adjustment time by half, the proposed system achieves the shortest injection time of 15.2s. Moreover, the simplicity of the system and the ease of operation contribute to the clinical feasibility of our system.

17:00-17:05, Paper WeDT8.5
Learning-Based Motion Controller for Reconfigurable Microswarms

Li, Yamei	The Hong Kong Polytechnic University
Tang, Yunxi	The Chinese University of Hong Kong
Wang, Yun	The Hong Kong Polytechnic University
Li, Yangmin	The Hong Kong Polytechnic University
Yang, Lidong	The Hong Kong Polytechnic University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots Abstract: Motion control of magnetic microswarms has attracted extensive attention due to its significance in microrobots-based biomedical applications such as targeted drug delivery. However, such reconfigurable microswarms are subject to complex interactions between individuals and environments which make accurate modeling challenging. These complexities of microswarms poses challenges for precise motion control, as traditional controllers often rely on precise mathematical models and manual parameter tuning that limits their scalability and efficiency. Learning-based methods, such as Deep Reinforcement Learning (DRL), offer an alternative but require large datasets (usually on the order of millions) and extensive exploration which may cause the microswarms instability in physical environments due to unreasonable actions during early training, therefore resulting in the sim-to real gap. Moreover, traditional DRL focuses on instantaneous state-action mappings, neglecting the sequential dependencies critical for accurate motion control, leading to low tracking accuracy in complex scenarios. To address these challenges, we propose a Learning from Demonstration (LfD)-based motion control framework, which inherently encode compensatory behaviors and task-specific adaptability into neural networks, enabling adaptive performance even under unmodeled disturbances. Furthermore, the neural networks consider a time series of microswarm states to determine the future control actions, enabling the system to learn sequential dependencies and transitions between states so as to ensure smooth and accurate motion control. Simulations and comparative experiments validate our framework’s effectiveness and demonstrate superior control accuracy and adaptability to microswarm’s shape changes.


WeDT9	309
Object Detection, Segmentation and Categorization 4	Regular Session
Co-Chair: Wu, Zhenyu	Beijing University of Posts and Telecommunications

16:40-16:45, Paper WeDT9.1
Gyrevento: Event-Based Omnidirectional Visual Gyroscope in a Manhattan World

Rodrigues Da Costa, Daniel	Université De Picardie Jules Verne
Vasseur, Pascal	Université De Picardie Jules Verne
Morbidi, Fabio	Université De Picardie Jules Verne
Keywords: Omnidirectional Vision, Vision-Based Navigation, Data Sets for Robotic Vision Abstract: In this paper, we study the problem of estimating the orientation of an event omnidirectional camera mounted on a robot and observing 3D parallel lines in a man-made environment (Manhattan world). We present Gyrevento, the first event-based omnidirectional visual gyroscope. Gyrevento does not require any initialization, provides certifiably globally optimal solutions, and is scalable, since the size of the nonlinear least-squares cost function is independent of the number of lines. Thanks to the Cayley-Gibbs-Rodrigues parameterization of a 3D rotation, this cost function is a degree-four rational function in three variables, which can be efficiently minimized via off-the-shelf polynomial optimization software. Numerical simulations and real-world experiments with a robot manipulator show the effectiveness of our visual gyroscope and elucidate the impact of camera velocity on the attitude estimation error.

16:45-16:50, Paper WeDT9.2
LGDD: Local-Global Synergistic Dual-Branch 3D Object Detection Using 4D Radar

Bai, Xiaokai	Zhejiang University
Qing, Yang	Zhejiang University
Zhou, Zili	Zhejiang University
Zhang, Fuyi	Zhejiang University
Zhe, Wu	Zhejiang University
Cao, Siyuan	Zhejiang University
Zheng, Lianqing	TONGJI University
Yu, Beinan	Zhejiang University
Wang, Fang	Hangzhou City University
Bai, Jie	Hangzhou City University
Shen, Hui-liang	Zhejaing University
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Recognition Abstract: 4D millimeter-wave radar plays a pivotal role in autonomous driving due to its cost-effectiveness and robustness in adverse weather. However, the application of 4D radar point cloud in 3D perception tasks is hindered by its inherent sparsity and noise. To address these challenges, we propose LGDD, a novel local-global synergistic dual-branch 3D object detection framework using 4D radar. Specifically, we first introduce a point-based branch, which utilize a voxel-attended point feature extractor (VPE) to integrate semantic segmentation with cluster voting, thereby mitigating radar noise and extracting local-clustered instances features. Then, for the conventional pillar-based branch, we design a query-based feature pre-fusion (QFP) to address the sparsity and enhance global context representation. Additionally, we devise a proposal mask to filter out noisy points, enabling more focused clustering on regions of interest. Finally, we align the local instances with global context through semantics-geometry aware fusion (SGF) module to achieve comprehensive scene understanding. Extensive experiments demonstrate that LGDD achieves state-of-the-art performance on the public View-of-Delft and TJ4DRadSet datasets. Source code is available at https://github.com/shawnnnkb/LGDD.

16:50-16:55, Paper WeDT9.3
STC-Tracker: Spatiotemporal-Consistent Multi-Robot Collaboration Framework for Long-Term Dynamic Object Tracking

Dong, Yanchao	Tongji University
Liu, Yuhao	Tongji University
Li, Jinsong	Tongji University
He, Bin	Tongji University
Keywords: Object Detection, Segmentation and Categorization, Visual Tracking, Multi-Robot Systems Abstract: Multi-robot cooperative tracking, as a vital subfield of multi-robot collaboration, exhibits significant potential in areas such as military reconnaissance and emergency rescue. Conventional dynamic object tracking methods often face issues of incomplete target detection and even loss in complex scenes, owing to variations in viewpoint or occlusion. To address these problems, this paper proposes STC-Tracker, a multi-robot collaborative tracking system aimed at extending the lifecycle of dynamic objects. On the one hand, the system restores the original appearance of objects by retracing historical point clouds from keyframes while monitoring their motion trajectories in real time. On the other hand, by estimating the motion model of each target, our system is capable of maintaining the lifecycle of specific objects, even in cases of brief disappearance. Experiments are conducted on public and self-collected datasets. The results demonstrate that our algorithm outperforms SOTAs in both single-robot and multi-robot configurations while exhibiting low computational resource consumption. In addition, our algorithm supports LiDARs of different scanning patterns, including spinning LiDARs and solid-state LiDARs, and is capable of real-time dynamic object tracking and global map construction.

16:55-17:00, Paper WeDT9.4
Anyview: General Indoor 3D Object Detection with Variable Frames

Wu, Zhenyu	Beijing University of Posts and Telecommunications
Xu, Xiuwei	Tsinghua University
Wang, Ziwei	Nanyang Technological University
Xia, Chong	Tsinghua University
Zhao, Linqing	Tsinghua University
Lu, Jiwen	Tsinghua University
Yan, Haibin	Beijing University of Posts and Telecommunications
Keywords: RGB-D Perception, Recognition, Deep Learning Methods Abstract: In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.

17:00-17:05, Paper WeDT9.5
Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

Lu, Yangxiao	The University of Texas at Dallas
P, Jishnu Jaykumar	The University of Texas at Dallas
Guo, Yunhui	The University of Texas at Dallas
Ruozzi, Nicholas	The University of Texas at Dallas
Xiang, Yu	University of Texas at Dallas
Keywords: Object Detection, Segmentation and Categorization, Representation Learning Abstract: Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilized foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce. We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting in the few-shot setting. Furthermore, the weight adapter optimizes weights to enhance the distinctiveness of instance embeddings during similarity computation. This methodology enables a straightforward matching strategy that results in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements in four detection datasets. In the segmentation tasks on seven core datasets of the BOP challenge, our method outperforms the leading published RGB methods and remains competitive with the best RGB-D method. We have also verified our method using real-world images from a Fetch robot and a RealSense camera.

17:05-17:10, Paper WeDT9.6
Class-Aware PillarMix: Can Mixed Sample Data Augmentation Enhance 3D Object Detection with Radar Point Clouds?

Zhang, Miao	Robert Bosch GmbH
Abdulatif, Sherif	Robert Bosch GmbH
Loesch, Benedikt	Robert Bosch GmbH
Altmann, Marco	Robert Bosch GmbH
Yang, Bin	University of Stuttgart
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Deep Learning Methods Abstract: Due to the significant effort required for data collection and annotation in 3D perception tasks, mixed sample data augmentation (MSDA) has been widely studied to generate diverse training samples by mixing existing data. Among these methods, MixUp is a prominent approach that generates new samples by linearly combining two existing ones, using a mix ratio sampled from a beta distribution. This simple yet powerful method has inspired numerous variations and applications in 2D and 3D data domains. Recently, many MSDA techniques have been developed for point clouds, but they mainly target LiDAR data, leaving their application to radar point clouds largely unexplored. In this paper, we examine the feasibility of applying existing MSDA methods to radar point clouds and identify several challenges in adapting these techniques. These obstacles stem from the radar's irregular angular distribution, deviations from a single-sensor polar layout in multi-radar setups, and point sparsity. To address these issues, we propose Class-Aware PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar level in 3D point clouds, guided by class labels. Unlike methods that rely a single mix ratio to the entire sample, CAPMix assigns an independent ratio to each pillar, boosting sample diversity. To account for the density of different classes, we use class-specific distributions: for dense objects (e.g., large vehicles), we skew ratios to favor points from another sample, while for sparse objects (e.g., pedestrians), we sample more points from the original. This class-aware mixing retains critical details and enriches each sample with new information, ultimately generating more diverse training data. Experimental results demonstrate that our method not only significantly boosts performance but also outperforms existing MSDA approaches across two datasets (Bosch Street and K-Radar). We believe that this straightforward yet effective approach will spark further investigation into MSDA techniques for radar data.

17:10-17:15, Paper WeDT9.7
Improved Calibration for Panoramic Annular Lens Systems with Angular Modulation

Wang, Ding	Fudan University
Wang, Junhua	Fudan University
Yuhan, Tian	Fudan University
Xu, Min	Fudan University
Lingbao, Kong	Fudan University
Keywords: Omnidirectional Vision, Recognition Abstract: This paper addresses the challenges of calibrating Panoramic Annular Lens (PAL) systems, which exhibit unique projection characteristics due to their imaging relationship designed to compress blind zones. Traditional camera calibration methods often fail to accurately capture these properties. To resolve this limitation, we propose a novel projection model that incorporates angular modulation, enabling a more accurate representation of the PAL system's imaging process. This formulation significantly improves the model's ability to describe the relationship between object space and image space. We evaluate our approach on both synthetic and real-world datasets tailored for PAL cameras. Experimental results demonstrate that the model achieves sub-pixel accuracy, with reprojection errors typically ranging from 0.1 to 0.3 pixels on 2048*2048 images when using five distortion terms. This level of precision surpasses existing calibration models for panoramic cameras, making our method particularly suitable for high-accuracy applications. The datasets used in this study are publicly available at https://github.com/wwendy233/PALcalib.


WeDT10	310
Bioinspired Robot Learning	Regular Session
Chair: Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Co-Chair: Cheng, Hui	Sun Yat-Sen University

16:40-16:45, Paper WeDT10.1
Biomechanically-Inspired Bipedal Robot Locomotion Via Hybrid Gait Representation and Model-Guided Reinforcement Learning

Xie, Lijie	Sun Yat-Sen University
Rong, Haomin	Sun Yat-Sen University
Chen, Zujian	Shenzhen University
Zhou, Zida	Sun Yat-Sen University
Mo, Shaolin	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Bioinspired Robot Learning, Deep Learning Methods, Legged Robots Abstract: Achieving stable and natural locomotion in bipedal robots, comparable to that of humans and animals, remains a long-standing challenge in robotics. In this work, we propose a bio-inspired low-level control framework that streamlines the generation of naturalistic gait patterns while ensuring adaptability. Our approach begins with the design of a low-dimensional gait representation that captures key characteristics of human and animal locomotion. This representation is then integrated with the Linear Inverted Pendulum Model (LIPM) to form an abstract yet effective motion descriptor. Serving as a kinematic reference within a reinforcement learning (RL) framework, this descriptor enables the training of control policies that strike a balance between biomechanical realism and adaptability. Rather than strictly adhering to predefined gait trajectories, the learned policies dynamically adjust to optimize both stability and velocity tracking. As a result, our method enables bipedal robots to exhibit smooth, biomechanically realistic locomotion while enhancing stability and adaptability. We validate the proposed framework through real-world experiments on our bipedal robot, demonstrating its ability to achieve stable and efficient locomotion.

16:45-16:50, Paper WeDT10.2
Bridge the Gap: Enhancing Quadruped Locomotion with Vertical Ground Perturbations

Stasica, Maximilian	Lauflabor Locomotion Laboratory, Institute of Sports Science And
Bick, Arne	Lauflabor Locomotion Laboratory, Institute of Sports Science And
Bohlinger, Nico	TU Darmstadt
Mohseni, Omid	Technische Universität Darmstadt
Fritzsche, Max Johannes Alois	Institute of Structural Mechanics and Design, Technical Universi
Hübler, Clemens	Institute of Structural Mechanics and Design, Technical Universi
Peters, Jan	Technische Universität Darmstadt
Seyfarth, Andre	TU Darmstadt
Keywords: Bioinspired Robot Learning, Legged Robots, Machine Learning for Robot Control Abstract: Legged robots, particularly quadrupeds, excel at navigating rough terrains, yet their performance under vertical ground perturbations, such as those from oscillating surfaces, remains underexplored. This study introduces a novel approach to enhance quadruped locomotion robustness by training the Unitree Go2 robot on an oscillating bridge—a 13.24-meter steel-and-concrete structure with a 2 Hz eigenfrequency designed to perturb locomotion. Using Reinforcement Learning (RL) with the Proximal Policy Optimization (PPO) algorithm in a MuJoCo simulation, we developed 15 distinct locomotion policies, combining five gaits (trot, pace, bound, free, default) with three training conditions: rigid bridge and two oscillating bridge setups with differing height regulation strategies (relative to bridge surface or ground). Domain randomization ensured zero-shot transfer to the real-world bridge. Our results demonstrate that policies trained on the oscillating bridge exhibit superior stability and adaptability compared to those trained on rigid surfaces. Our framework enables robust gait patterns even without prior bridge exposure. These findings highlight the potential of simulation-based RL to improve quadruped resilience against dynamic ground perturbations, offering insights for designing robots capable of traversing vibrating environments.

16:50-16:55, Paper WeDT10.3
Bio-Inspired Hybrid Map: Spatial Implicit Local Frames and Topological Map for Mobile Cobot Navigation

Dang, Tuan	University Taxes at Arlington
Huber, Manfred	University of Texas at Arlington
Keywords: Bioinspired Robot Learning, AI-Enabled Robotics Abstract: Navigation is a fundamental capacity for mobile robots, enabling them to operate autonomously in complex and dynamic environments. Conventional approaches use probabilistic models to localize robots and build maps simultaneously using sensor observations. Recent approaches employ human-inspired learning, such as imitation and reinforcement learning, to navigate robots more effectively. However, these methods suffer from high computational costs, global map inconsistency, and poor generalization to unseen environments. This paper presents a novel method inspired by how humans perceive and navigate themselves effectively in novel environments. Specifically, we first build local frames that mimic how humans represent essential spatial information in the short term. Points in local frames are hybrid representations, including spatial information and learned features, so-called spatial-implicit local frames. Then, we integrate spatial-implicit local frames into the global topological map represented as a factor graph. Lastly, we developed a novel navigation algorithm based on Rapid-Exploring Random Tree Star (RRT*) that leverages spatial-implicit local frames and the topological map to navigate effectively in environments. To validate our approach, we conduct extensive experiments in real-world datasets and in-lab environments. We open our source code at https://github.com/tuantdang/simn.

16:55-17:00, Paper WeDT10.4
Bio-Inspired Plastic Neural Networks for Zero-Shot Out-Of-Distribution Generalization in Complex Animal-Inspired Robots

Leung, Binggwong	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Haomachai, Worasuchad	Vidyasirimedhi Institute of Science & Technology
Pedersen, Joachim Winther	IT University of Copenhagen
Risi, Sebastian	IT University of Copenhagen
Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Keywords: Bioinspired Robot Learning, Machine Learning for Robot Control, Biologically-Inspired Robots Abstract: Artificial neural networks can be used to solve a variety of robotic tasks. However, they risk failing catastrophically when faced with out-of-distribution (OOD) situations. Several approaches have employed a type of synaptic plasticity known as Hebbian learning that can dynamically adjust weights based on local neural activities. Research has shown that synaptic plasticity can make policies more robust and help them adapt to unforeseen changes in the environment. However, networks augmented with Hebbian learning can lead to weight divergence, resulting in network instability. Furthermore, such Hebbian networks have not yet been applied to solve legged locomotion in complex real robots with many degrees of freedom. In this work, we improve the Hebbian network with a weight normalization mechanism for preventing weight divergence, analyze the principal components of the Hebbian's weights, and perform a thorough evaluation of network performance in locomotion control for real 18-DOF dung beetle-like and 16-DOF gecko-like robots. We find that the Hebbian-based plastic network can execute zero-shot sim-to-real adaptation locomotion and generalize to unseen conditions, such as uneven terrain and morphological damage.

17:00-17:05, Paper WeDT10.5
Two-Stage Learning Framework Combining Joint-Level Reinforcement Learning and Muscle-Level Adaptation for Musculoskeletal Locomotion

Azoulay, Laurie	Institut National Des Sciences Appliquées De Lyon
Kutsuzawa, Kyo	Saitama University
Koseki, Shunsuke	Tohoku University
Owaki, Dai	Tohoku University
Hayashibe, Mitsuhiro	Tohoku University
Keywords: Bioinspired Robot Learning, Biomimetics, Deep Learning Methods Abstract: Animal musculoskeletal systems are renowned for their ability to dynamically regulate stiffness and achieve energy-efficient motion. Being inspired by the biological control structure, this study presents a hybrid control framework that utilizes two-stage learning processes for body movement planning and muscle force computation. This methodology simplifies the learning process under joint redundancy and muscle redundancy. Then it enhances the interpretability of the resultant generated behaviors. The framework incorporates a reinforcement learning (RL)-trained joint controller to optimize joint torques, in conjunction with an LSTM-based muscle controller that translates these torques into muscle activations. Two control variants are proposed: One is prioritizing energy efficiency and the other is enhancing adaptability to environmental perturbations through co-contraction control. Validation with MuJoCo physics simulations demonstrates the framework's capacity to autonomously learn and refine different gait modes without dependence on external motion datasets. The second variant demonstrates superior robustness and energy efficiency compared to conventional motor-driven models. This framework contributes to the enhancement of adaptability in complex scenarios dealing with the redundancy problem of musculoskeletal system coordination and holds potential for the development of bio-inspired locomotion control through the optimization of muscle activity composition.

17:05-17:10, Paper WeDT10.6
NavHD: Low-Power Learning for Micro-Robotic Controls in the Wild

Lee, Chae Young	Stanford University
Achour, Sara	Stanford University
Kapetanovic, Zerina	Stanford University
Keywords: Sensorimotor Learning, Bioinspired Robot Learning, Micro/Nano Robots Abstract: Micro-robots are emerging as powerful tools for search-and-rescue, precision agriculture, and cooperative manipulation, where their small size and low cost offer advantages over larger robots. However, enabling autonomous navigation on these robots remains challenging due to severe hardware constraints, such as limited memory, energy, and computational power. We explore a brain-inspired learning paradigm called Hyperdimensional Computing (HDC) to equip a cheap, lightweight navigation model that runs onboard micro-robots. We present NavHD, which features an adaptive HD encoder that learns spatial representations and incorporates loss-based training for both imitation learning and off-policy reinforcement learning. Our hardware implementation of NavHD uses eight ultrasound sensors and is optimized to run on an ARM Cortex-M4 core, using only 10.2 kB of memory, 900 clock cycles and 1.1 mJ of energy per inference. Through experiments in both simulation and the real world, we demonstrate that NavHD outperforms DNN-based and prior HDC-based RL methods in obstacle avoidance by more than 2x the performance, while achieving 2-26x more superior resource efficiency.

17:10-17:15, Paper WeDT10.7
Physics-Aware Combinatorial Assembly Sequence Planning Using Data-Free Action Masking

Liu, Ruixuan	Carnegie Mellon University
Chen, Alan	Westlake Highschool
Zhao, Weiye	Carnegie Mellon University, Robotic Institution
Liu, Changliu	Carnegie Mellon University
Keywords: Assembly, Task Planning, Reinforcement Learning Abstract: Combinatorial assembly uses standardized unit primitives to build objects that satisfy user specifications. This paper studies assembly sequence planning (ASP) for physical combinatorial assembly. Given the shape of the desired object, the goal is to find a sequence of actions for placing unit primitives to build the target object. In particular, we aim to ensure the planned assembly sequence is physically executable. However, ASP for combinatorial assembly is particularly challenging due to its combinatorial nature. To address the challenge, we employ deep reinforcement learning to learn a construction policy for placing unit primitives sequentially to build the desired object. Specifically, we design an online physics-aware action mask that filters out invalid actions, which effectively guides policy learning and ensures violation-free deployment. In the end, we apply the proposed method to Lego assembly with more than 250 3D structures. The experiment results demonstrate that the proposed method plans physically valid assembly sequences to build all structures, achieving a 100% success rate, whereas the best comparable baseline fails more than 40 structures. Our implementation is available at https://github.com/intelligent-control-lab/PhysicsAwareCombinatorialASP .

17:15-17:20, Paper WeDT10.8
MICL: Mutual Information Guided Continual Learning for LiDAR Place Recognition

Liu, BinHong	Northwestern Polytechnical University
Yang, Tao	Northwestern Polytechnical University
Fang, YangWang	Northwestern Polytechnical University
Yan, Zhi	École Nationale Supérieure De Techniques Avancées (ENSTA)
Keywords: Continual Learning, Localization, Incremental Learning Abstract: LiDAR Place Recognition (LPR) aims to identify previously visited places across different environments and times. Thanks to the recent advances in Deep Neural Networks (DNNs), LPR has experienced rapid development. However, DNN-based LPR methods may suffer from Catastrophic Forgetting (CF), where they tend to forget previously learned domains and focus more on adapting to a new domain. In this paper, we propose Mutual Information-guided Continual Learning (MICL) to tackle this problem in LPR. We design a domain-sharing loss function Mutual Information Loss (MIL) to encourage existing DNN-based LPR methods to learn and preserve knowledge that may not be useful for the current domain but potentially beneficial for other domains. MIL overcomes CF from an information-theoretic perspective including two aspects:1) maximizing the preservation of information from input data in descriptors, and 2) maximizing the preservation of information in descriptors when training across different domains. Additionally, we design a simple yet effective memory sampling strategy to further alleviate CF in LPR. Furthermore, we adopt adaptive loss weighting, which reduces the need for hyperparameters and enables models to make optimal trade-offs automatically. We conducted experiments on three large-scale LiDAR datasets including Oxford, MulRan, and PNV. The experimental results demonstrate that our MICL outperforms state-of-the-art continual learning approaches. The code of MICL is publicly available at: https://github.com/npu-ius-lab/MICL


WeDT11	311A
Reinforcement Learning 8	Regular Session
Chair: Lee, Dongjun	Seoul National University
Co-Chair: Ren, Xiaoqiang	Shanghai University

16:40-16:45, Paper WeDT11.1
Heterogeneous Multi-Robot Task Allocation and Scheduling Via Reinforcement Learning

Dai, Weiheng	National University of Singapore
Rai, Utkarsh	CVIT, IIIT Hyderababd
Chiun, Jimmy	National University of Singapore
Cao, Yuhong	National University of Singapore
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems, Reinforcement Learning Abstract: Many multi-robot applications require allocating a team of heterogeneous agents (robots) with different abilities to cooperatively complete a given set of spatially distributed tasks as quickly as possible. We focus on tasks that can only be initiated when all required agents are present otherwise arrived agents would be waiting idly. Agents need to not only execute a sequence of tasks by dynamically forming and disbanding teams to satisfy/match diverse ability requirements of each task but also account for the schedules of other agents to minimize unnecessary idle time. Conventional methods such as mix-integer programming generally require centralized scheduling and a long optimization time, which limits their potential for real-world applications. In this work, we propose a reinforcement learning framework to train a decentralized policy applicable to heterogeneous agents. To address the challenge of complex cooperation learning, we further introduce a constrained flashforward mechanism to guide/constrain the agents' exploration and help them make better predictions. Through an attention mechanism that reasons about both short-term cooperation and long-term scheduling dependency, agents learn to reactively choose their next tasks (and subsequent coalitions) to avoid wasting abilities and to shorten the overall task completion time (makespan). We compare our method with state-of-the-art heuristic and mixed-integer programming methods, demonstrating its generalization ability and showing it closely matches or outperforms these baselines while remaining at least two orders of magnitude faster.

16:45-16:50, Paper WeDT11.2
Safe and Efficient Multi-Agent Collision Avoidance with Physics-Informed Reinforcement Learning

Feng, Pu	Beihang University
Shi, Rongye	Beihang University
Wang, Size	Beihang University
Liang, Junkang	Beihang University
Yu, Xin	Beihang University
Li, Simin	Beihang University
Wu, Wenjun	Beihang University
Keywords: Reinforcement Learning, Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance Abstract: Reinforcement learning (RL) has shown great promise in addressing multi-agent collision avoidance challenges. However, existing RL-based methods often suffer from low training efficiency and poor action safety. To tackle these issues, we introduce a physics-informed reinforcement learning framework equipped with two modules: a Potential Field (PF) module and a Multi-Agent Multi-Level Safety (MAMLS) module. The PF module uses the Artificial Potential Field method to compute regularization, adaptively integrating it into the critic’s loss to enhance training efficiency. The MAMLS module formulates action safety as a constrained optimization problem, deriving safe actions by solving this optimization. Furthermore, to better address the characteristics of multi-agent collision avoidance tasks, multi-agent multi-level constraints are introduced. The results of simulations and real-world experiments showed that our physics-informed framework offers a significant improvement in terms of both the efficiency of training and safety-related metrics over advanced baseline methods.

16:50-16:55, Paper WeDT11.3
TERL: Large-Scale Multi-Target Encirclement Using Transformer-Enhanced Reinforcement Learning

Zhang, Heng	Shanghai University
Zhao, Guoxiang	Shanghai University
Ren, Xiaoqiang	Shanghai University
Keywords: Reinforcement Learning, Multi-Robot Systems, Task and Motion Planning Abstract: Pursuit-evasion (PE) problem is a critical challenge in multi-robot systems (MRS). While reinforcement learning (RL) has shown its promise in addressing PE tasks, research has primarily focused on single-target pursuit, with limited exploration of multi-target encirclement, particularly in largescale settings. This paper proposes a Transformer-Enhanced Reinforcement Learning (TERL) framework for large-scale multi-target encirclement. By integrating a transformer-based policy network with target selection, TERL enables robots to adaptively prioritize targets and safely coordinate robots. Results show that TERL outperforms existing RL-based methods in terms of encirclement success rate and task completion time, while maintaining good performance in large-scale scenarios. Notably, TERL, trained on small-scale scenarios (15 pursuers,4 targets), generalizes effectively to large-scale settings (80 pursuers, 20 targets) without retraining, achieving a 100% success rate. The code and demonstration video are available at https://github.com/ApricityZ/TERL.

16:55-17:00, Paper WeDT11.4
Heterogeneous Multi-Agent Learning in Isaac Lab: Scalable Simulation for Robotic Collaboration

Haight, Jacob	Utah State University
Peterson, Isaac	Utah State University
Allred, Christopher	Utah State University
Harper, Mario	Utah State University
Keywords: Multi-Robot Systems, Reinforcement Learning, Cooperating Robots Abstract: Multi-Agent Reinforcement Learning (MARL) plays a crucial role in robotic coordination and control, yet existing simulation environments often lack the fidelity and scalability needed for real-world applications. In this work, we extend Isaac Lab to support efficient training of both homogeneous and heterogeneous multi-agent robotic policies in high-fidelity physics simulations. Our contributions include the development of diverse MARL environments tailored for robotic coordination, integration of Heterogeneous Agent Re- inforcement Learning (HARL) algorithms, and a scalable GPU- accelerated framework optimized for large-scale training. We evaluate our framework using two state-of-the-art MARL algorithms—Multi Agent Reinforcement Learning with Proxi- mal Policy Optimization (MAPPO) and Heterogeneous Agent Reinforcement Learning with Proximal Policy Optimization (HAPPO)—across six challenging robotic tasks. Our results confirm the feasibility of training heterogeneous agents in high-fidelity environments while maintaining the scalability and performance benefits of Isaac Lab. By enabling realistic multi-agent learning at scale, our work lays a foundation for advancing MARL research in physics-driven robotics. The source code and demonstration videos are available at https://some45bucks.github.io/IsaacLab_HARL/.

17:00-17:05, Paper WeDT11.5
Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Venkataraman, Sreyas	Indian Institute of Technology, Kharagpur
Wang, Yufei	Carnegie Mellon University
Wang, Ziyu	Tsinghua University
Ravie, Navin Sriram	Indian Institute of Technology Madras
Erickson, Zackory	Carnegie Mellon University
Held, David	Carnegie Mellon University
Keywords: Reinforcement Learning, Learning from Demonstration, Machine Learning for Robot Control Abstract: Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets. Videos can be found on our project website.

17:05-17:10, Paper WeDT11.6
SAC(λ): Efficient Reinforcement Learning for Sparse-Reward Autonomous Car Racing Using Imperfect Demonstrations

Lee, Heeseong	Seoul National University
Sagong, Sungpyo	Seoul National University
Lee, Minhyeong	Seoul National University
Lee, Jeongmin	Seoul National University
Lee, Dongjun	Seoul National University
Keywords: Reinforcement Learning, Autonomous Vehicle Navigation, Learning from Demonstration Abstract: Recent advances in Reinforcement Learning (RL) have demonstrated promising results in autonomous car racing. However, two fundamental challenges remain: sparse rewards, which hinder efficient learning process, and the quality of demonstrations, which directly affects the effectiveness of RL from Demonstration (RLfD) approaches. To address these issues, we propose SAC(λ), a novel RLfD algorithm tailored for sparse-reward racing tasks with imperfect demonstrations. SAC(λ) introduces two key components: (1) a discriminator-augmented Q-function, which integrates prior knowledge from demonstrations into value estimation while maintaining off-policy learning benefits, and (2) a Positive-Unlabeled (PU) learning framework with adaptive prior adjustment, which enables the agent to progressively refine its understanding of positive behaviors, while mitigating the overfitting problem. Through extensive experiments in the Assetto Corsa simulator, we demonstrate that SAC(λ) significantly accelerates training, surpasses the provided demonstrations, and achieves superior lap times over existing RL and RLfD approaches. Code and videos are available at https://heesungsung.github.io/AC-RLRacer.

17:10-17:15, Paper WeDT11.7
Diffusion Policies with Value-Conditional Optimization for Offline Reinforcement Learning

Ma, Yunchang	National University of Defense Technology
Liu, Tenglong	National University of Defense Technology
Lan, Yixing	National University of Defense Technology
Yin, Xin	College of Intelligence Science and Technology, National Univers
Zhang, Changxin	National University of Defense Technology
Zhang, Xinglong	National University of Defense Technology
Xu, Xin	National University of Defense Technology
Keywords: Reinforcement Learning, Model Learning for Control, Machine Learning for Robot Control Abstract: In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset’s distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.

17:15-17:20, Paper WeDT11.8
Observer-Based Multi-Agent Reinforcement Learning for Pursuit-Evasion Game with Multiple Unknown Uncertainties (I)

Liu, Yangyang	Shanghai University
Liu, Chun	Shanghai University
Meng, Yizhen	Shanghai Aerospace Control Technology Institute
Bin, Jiang	Nanjing Univ. of Aeronautics and Astronautics
Wang, Xiaofan	Shanghai University
Keywords: Reinforcement Learning, Agent-Based Systems Abstract: This paper aims to investigate the challenging problem of a multi-agent game with multiple pursuers and a single evader in an environment with multiple unknown uncertainties. A coupled approach combining decentralized observers and reinforcement learning (RL) controllers is proposed to deal with this scenario. Firstly, decentralized observers driven by auxiliary control laws are introduced to estimate the states of uncertain systems, with their best responses obtained through the adaptive dynamic programming (ADP) method. The estimated states, which reflect the actual states of the pursuers’ systems, are concurrently transmitted to the RL controllers. Subsequently, the controllers are trained with observer-based heterogeneous-agent proximal policy optimization (OHAPPO) algorithm, in which a novel global multi-function cost is designed. The algorithm utilizes the advantage decomposition for policy updates in the way of credit assignment, resulting in more stable and efficient updates compared to traditional value decomposition. Moreover, to further enhance the performance of both observers and controllers, a sequential game is established between them, where observers’ policies are influenced by controllers’ optimal control and vice versa. Finally, the simulation results verify the effectiveness of the designed OHAPPO algorithm in the pursuit evasion game.


WeDT12	311B
Robotic Imitation Learning 4	Regular Session
Chair: Bıyık, Erdem	University of Southern California
Co-Chair: Wang, Gang	Beijing Institute of Technology

16:40-16:45, Paper WeDT12.1
Action Tokenizer Matters in In-Context Imitation Learning

Vuong, An Dinh	MBZUAI
Vu, Minh Nhat	TU Wien, Austria
An, Dong	MBZUAI
Reid, Ian	University of Adelaide
Keywords: Imitation Learning, Learning from Demonstration, Manipulation Planning Abstract: In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates more stable and smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments further confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints will be released.

16:45-16:50, Paper WeDT12.2
Robust Offline Imitation Learning through State-Level Trajectory Stitching

Wang, Shuze	Beijing Institute of Technology
Mei, Yunpeng	Beijing Institute of Technology
Cao, Hongjie	Beijing Institute of Technology
Yuan, Yetian	Beijing Institute of Technology
Wang, Gang	Beijing Institute of Technology
Jian, Sun	Beijing Institute of Technology
Chen, Jie	Tongji University
Keywords: Imitation Learning, Learning from Demonstration, AI-Based Methods Abstract: Imitation learning (IL) has proven effective for enabling robots to acquire visuomotor skills through expert demonstrations. However, traditional IL methods are limited by their reliance on high-quality, often scarce, expert data, and suffer from covariate shift. To address these challenges, recent advances in offline IL have incorporated suboptimal, unlabeled datasets into the training. In this paper, we propose a novel approach to enhance policy learning from mixed-quality offline datasets by leveraging task-relevant trajectory fragments and rich environmental dynamics. Specifically, we introduce a state-based search framework that stitches state-action pairs from imperfect demonstrations, generating more diverse and informative training trajectories. Experimental results on standard IL benchmarks and real-world robotic tasks showcase that our proposed method significantly improves both generalization and performance.

16:50-16:55, Paper WeDT12.3
GABRIL: Gaze-Based Regularization for Mitigating Causal Confusion in Imitation Learning

Banayeeanzade, Amin	University of Southern California
Bahrani, Fatemeh	University of Southern California
Zhou, Yutai	University of Southern California
Bıyık, Erdem	University of Southern California
Keywords: Learning from Demonstration, Imitation Learning, Representation Learning Abstract: Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL), a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents. The datasets, the code, and some experiment videos are publicly available at https://liralab.usc.edu/gabril .

16:55-17:00, Paper WeDT12.4
FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation

He, Zihao	Shanghai Jiao Tong University
Fang, Hongjie	Shanghai Jiao Tong University
Chen, Jingjing	Shanghai Jiao Tong University
Fang, Hao-Shu	Massachusetts Institute of Technology
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Imitation Learning, Force and Tactile Sensing, Perception for Grasping and Manipulation Abstract: Contact-rich tasks present significant challenges for robotic manipulation policies due to the complex dynamics of contact and the need for precise control. Vision-based policies often struggle with the skill required for such tasks, as they typically lack critical contact feedback modalities like force/torque information. To address this issue, we propose FoAR, a force-aware reactive policy that combines high-frequency force/torque sensing with visual inputs to enhance the performance in contact-rich manipulation. Built upon the RISE policy, FoAR incorporates a multimodal feature fusion mechanism guided by a future contact predictor, enabling dynamic adjustment of force/torque data usage between non-contact and contact phases. Its reactive control strategy also allows FoAR to accomplish contact-rich tasks accurately through simple position control. Experimental results demonstrate that FoAR significantly outperforms all baselines across various challenging contact-rich tasks while maintaining robust performance under unexpected dynamic disturbances. Project website: https://tonyfang.net/FoAR/.

17:00-17:05, Paper WeDT12.5
Inverse Model Predictive Control: Learning Optimal Control Cost Functions for MPC (I)

Zhang, Fawang	Beijing Institute of Technology
Duan, Jingliang	University of Science and Technology Beijing
Liu, Hui	Beijing Institute of Technology
Nie, Shida	Beijing Institute of Technology
Xie, Yujia	Beijing Institute of Technology
Guo, Congshuai	Beijing Institute of Technology
Keywords: Imitation Learning, Learning from Demonstration, Wheeled Robots Abstract: Inverse optimal control (IOC) seeks to infer a control cost function that captures the underlying goals and preferences of expert demonstrations. While significant progress has been made in finite-horizon IOC, which focuses on learning control cost functions based on rollout trajectories rather than actual trajectories, the application of IOC to receding horizon control, also known as model predictive control (MPC), has been overlooked. MPC is more prevalent in practical settings and poses additional challenges for IOC learning since it is complicated to calculate the gradient of actual trajectories with respect to cost parameters. In light of this, we propose the inverse MPC (IMPC) method to identify the optimal cost function that effectively minimizes the discrepancy between the actual trajectory and its associated demonstration. To compute the gradient of actual trajectories with respect to cost parameters, we first establish two differential Pontryagin's Maximum Principle (PMP) conditions by differentiating the traditional PMP conditions with respect to cost parameters and initial states, respectively. We then formulate two auxiliary optimal control problems based on the derived differentiated PMP conditions, whose solutions can be directly used to determine the gradient for updating cost parameters. We validate the efficacy of the proposed method through experiments involving five simulation tasks and two real-world mobile robot control tasks. The results consistently demonstrate that IMPC outperforms existing finite-horizon IOC methods across all experiments.

17:05-17:10, Paper WeDT12.6
I-CTRL: Imitation to Control Humanoid Robots through Bounded Residual Reinforcement Learning

Yan, Yashuai	Vienna University of Technology
Valls Mascaro, Esteve	Technische Universitat Wien
Egle, Tobias	TU Wien
Lee, Dongheui	Technische Universität Wien (TU Wien)
Keywords: Imitation Learning, Reinforcement Learning, Humanoid and Bipedal Locomotion Abstract: Humanoid robots have the potential to mimic human motions with high visual fidelity, yet translating these motions into practical, physical execution remains a significant challenge. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. This paper addresses these issues by introducing a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. Our framework, Imitation to Control Humanoid Robots Through Constraint Reinforcement Learning (I-CTRL), reformulates motion imitation as a constrained refinement over non-physics-based retargeted motions. I-CTRL excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and phys

17:10-17:15, Paper WeDT12.7
Interactive Incremental Learning of Generalizable Skills with Local Trajectory Modulation

Knauer, Markus	German Aerospace Center (DLR)
Albu-Schäffer, Alin	DLR - German Aerospace Center
Stulp, Freek	DLR - Deutsches Zentrum Für Luft Und Raumfahrt E.V
Silvério, João	German Aerospace Center (DLR)
Keywords: Incremental Learning, Imitation Learning, Continual Learning Abstract: The problem of generalization in learning from demonstration (LfD) has received considerable attention over the years, particularly within the context of movement primitives, where a number of approaches have emerged. Recently, two important approaches have gained recognition. While one leverages via-points to adapt skills locally by modulating demonstrated trajectories, another relies on so-called task-parameterized (TP) models that encode movements with respect to different coordinate systems, using a product of probabilities for generalization. While the former are well-suited to precise, local modulations, the latter aim at generalizing over large regions of the workspace and often involve multiple objects. Addressing the quality of generalization by leveraging both approaches simultaneously has received little attention. In this work, we propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions. Building on the kernelized movement primitives (KMP) framework, we introduce novel mechanisms for skill modulation from direct human corrective feedback. Our approach particularly exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided. We evaluate our method on a bearing ring-loading task using a torque-controlled, 7-DoF, DLR SARA robot.

17:15-17:20, Paper WeDT12.8
RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

Heng, Liang	Peking University
Li, Xiaoqi	Peking University
Mao, Shangqing	Peking University
Liu, Jiaming	Peking University
Liu, Ruolin	Peking University
Wei, Jingli	Peking University
Wang, Yu-Kai	Peking University
Yueru, Jia	Peking University
Gu, Chenyang	Peking University
Zhao, Rui	Tencent
Zhang, Shanghang	Peking University
Dong, Hao	Peking University
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data. However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations. To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap. Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations. We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations. Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model. In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method.


WeDT13	311C
Deep Learning for Visual Perception 8	Regular Session
Chair: Han, Dong Seog	Kyungpook National University
Co-Chair: Jiang, Anqing	Robert Bosch

16:40-16:45, Paper WeDT13.1
LIM: A Low-Complexity Local Feature Image Matching Network for Real-Time Embedded Applications

Ying, Shanquan	Ningbo University
Zhao, Jianfeng	Ningbo University
Dai, Junjie	Ningbo University
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Omnidirectional Vision Abstract: Image matching is a fundamental task in computer vision, underpinning applications such as visual localization and structure-from-motion. While deep convolutional neural network (CNN)-based approaches have achieved high detection accuracy, their high computational cost limits their deployment on resource-constrained platforms such as mobile and embedded systems. This paper presents a lightweight image matching network that achieves a favorable trade-off between accuracy and efficiency. The proposed model further enhances robustness to large image rotations, a common challenge in aerial and robotics applications. Extensive experiments demonstrate that our method maintains competitive accuracy while significantly reducing inference time compared to existing CNN-based approaches.

16:45-16:50, Paper WeDT13.2
SparseMeXt: Unlocking the Potential of Sparse Representations for HD Map Construction

Jiang, Anqing	Robert Bosch
Chai, Jinhao	Shanghai University
Gao, Yu	Robert Bosch GmbH
Wang, Yiru	Bosch
Heng, Yuwen	Bosch Corporate Research
Sun, Zhigang	Bosch Research
Sun, Hao	National University of Singapore
Sun, Li	University of Sheffield
Zhao, Zezhong	Bosch XC
Zhou, Jian	Bosch XC
Zhu, LiJuan	Bosch CR
Zhao, Hao	Tsinghua University
Xu, Shugong	Xi'an Jiaotong-Liverpool University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Object Detection, Segmentation and Categorization Abstract: Recent advancements in high-definition (HD) map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird’s-eye view (BEV) features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with—and ultimately surpass—dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision (mAP) of 55.5% at 32 frames per second (fps), while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field.

16:50-16:55, Paper WeDT13.3
EdgeSR: Reparameterization-Driven Fast Thermal Super-Resolution for Edge Electro-Optical Device

Fu, Changhong	Tongji University
Lu, Ziyu	Tongji University
Li, Mengyuan	Tongji University
Zhang, Zijie	Tongji University
Zuo, Haobo	University of Hong Kong
Keywords: Deep Learning for Visual Perception, AI-Based Methods, Data Sets for Robotic Vision Abstract: Super-resolution (SR) can greatly promote the development of edge electro-optical (EO) devices. However, most existing SR models struggle to simultaneously achieve effective thermal reconstruction and real-time inference on edge EO devices with limited computing resources. To address these issues, this work proposes a novel fast thermal SR model (EdgeSR) for edge EO devices. Specifically, reparameterized scale-integrated convolutions (RepSConv) are proposed to deeply explore high-frequency features, incorporating multiscale information and enhancing the scale-awareness of the backbone during the training phase. Furthermore, an interactive reparameterization module (IRM), combining historical high-frequency with low-frequency information, is introduced to guide the extraction of high-frequency features, ultimately boosting the high-quality reconstruction of thermal images. Edge EO deployment-oriented reparameterization (EEDR) is designed to reparameterize all modules into standard convolutions that are hardware-friendly for edge EO devices and onboard real-time inference. Additionally, a new benchmark for thermal SR on cityscapes (CS-TSR) is built. The experimental results on this benchmark show that, compared to state-of-the-art lightweight SR networks, EdgeSR delivers superior reconstruction quality and faster inference speed on edge EO devices. In real-world applications, EdgeSR exhibits robust performance on edge EO devices, making it suitable for real-world deployment. The code and demo is available at https://github.com/vision4robotics/EdgeSR.

16:55-17:00, Paper WeDT13.4
Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

Abdullahi Moallim Mohamud, Safaa	Kyungpook National University
Baek, Minjin	Kyungpook National University
Han, Dong Seog	Kyungpook National University
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Human-Centered Robotics Abstract: In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., "Is it possible for the ego vehicle to turn left at the intersection?") triggers more detailed sub-questions (e.g., "Is there a vehicle approaching the intersection from the opposite direction?"). To optimize inference time, questions are dynamically skipped based on previous answers, minimizing computational overhead. The extracted answers are then synthesized using handcrafted templates to ensure coherent, contextually accurate scene descriptions. We evaluate the proposed approach on the custom dataset using GPT reference-free scoring, demonstrating its competitiveness with state-of-the-art methods like GPT-4o in capturing key scene details while achieving significantly lower inference time. Moreover, qualitative results from real-time deployment highlight the proposed approach's capacity to capture key driving elements with minimal latency. The code is available at https://github.com/knu-citac/Hierarchical_QA.

17:00-17:05, Paper WeDT13.5
HPLaw: Heterogeneous Parallel LiDARs for Adverse Weather in V2V

Liu, Yuhang	Chinese Academy of Science
Ma, Xinyue	Tsinghua University
Wang, Xingxia	University of Chinese Academy of Sciences
Boyi, Sun	Institute of Automation, Chinese Academy of Sciences
Wang, Yutong	Institute of Automation, Chinese Academy of Sciences
Fenghua, Zhu	Chinese Academy of Sciences, Beijing
Wang, Feiyue	Institute of Automation, Chinese Academy of Sciences
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Sensor Fusion Abstract: Parallel LiDAR emerges as an innovative framework for next-generation intelligent LiDAR systems in autonomous driving. In parallel LiDAR research, V2V (Vehicle-to-Vehicle) cooperative perception is a promising technology which can effectively enhance perception range and accuracy through inter-agent information exchange. Currently, sensor heterogeneity remains a critical challenge in V2V. Although some work has made initial attempts to address this issue, existing studies are primarily conducted under ideal clear-weather conditions, ignoring the impact of variable weather factors in real-world applications. In fact, adverse weather has been shown to significantly degrade the performance of LiDAR systems, with the risk of cumulative degradation in V2V. To address this challenge, we first introduce OPV2V-W and V2V4Real-W as new benchmarks to study sensor heterogeneity in V2V under adverse weather. Then we propose the HPLaw architecture (Heterogeneous Parallel LiDARs for Adverse Weather), a self-knowledge distillation method designed to enhance model robustness across varying weather scenarios. HPLaw employs an efficient PF network to facilitate heterogeneous feature fusion and incorporates an SAKD module to extract weather-invariant features. Extensive experiments demonstrate that the student model in HPLaw achieves outstanding performance under all weather conditions, exhibiting remarkable robustness.

17:05-17:10, Paper WeDT13.6
3D-AMTA: Occlusion-Aware Real-Time 3D Hand Pose Estimation with Auto Mask and Token-Specific Attention

Zhao, Dongfang	Samsung
Zhang, Menghe	Samsung
Liang, Yangwen	Samsung Semiconductor Inc
Wang, Shuangquan	Samsung
Song, Kee-Bong	Samsung
Kim, Donghoon	Samsung
Keywords: Deep Learning for Visual Perception, Touch in HRI, Deep Learning Methods Abstract: Understanding hand motion from a single RGB image is challenging due to occlusions and high articulation. This paper presents 3D-AMTA, a transformer-based framework with Auto Mask and Token-specific Attention for occlusion-aware 3D hand pose estimation (HPE). The two novel architectural enhancements: auto mask for high-occlusion scenarios, and token-specific attention for fine-grained hand articulations. These modules seamlessly integrate into transformer-based architectures that enhance real-time performance in interactive systems. % To enable efficient deployment on robotic and embedded platforms, we propose 3D-AMTA-Mobile, a lightweight variant optimized for on-device processing. It achieves 267 FPS on NVIDIA RTX 2080Ti-GPU while maintaining high accuracy, making it well-suited for resource-constrained robotic applications. Extensive evaluations on FreiHAND and HO3D demonstrate that our approach consistently outperforms state-of-the-art methods in terms of accuracy, efficiency, and inference speed. These advancements contribute to robust hand perception for interactive robotics and AR-based teleoperation.

17:10-17:15, Paper WeDT13.7
Depth Estimation Based on Fisheye Cameras

Zhou, Yuwei	Rochester Institute of Technology
Lu, Guoyu	University of Georgia
Keywords: Deep Learning for Visual Perception Abstract: Fisheye cameras, with their ultra-wide field of view, offer significant benefits for depth estimation in applications such as autonomous navigation, robotics, and immersive imaging by capturing more scene content from a single viewpoint. However, their strong radial distortion and varying spatial resolution across the image pose substantial challenges for accurate depth prediction. We present a deep learning–based framework for fisheye depth estimation that addresses these challenges while leveraging the wide coverage advantage. During training, rectified and synchronized stereo image pairs are used, with the right image and an estimated initial depth map reconstructing the left image. A refined spatial consistency loss is formulated by combining Structural Similarity Index Measure (SSIM) and L1 loss, with gradient-based weighting to emphasize disparity edges. To overcome the limitations of photometric loss in disparity learning, we normalize pixel intensities to better correlate disparity with appearance features. A fisheye-specific depth refinement module incorporates an uncertainty map derived from an inconsistency mask and a distortion distribution map, mitigating the effects of occlusion and high-distortion regions. This uncertainty map is used to weight the temporal warping loss, enhancing robustness against distortion-prone areas. During inference, only a single fisheye image is required to produce an accurate depth map. Experimental results demonstrate that our method improves reconstruction fidelity and robustness, making it well-suited for real-world fisheye-based depth estimation tasks.

17:15-17:20, Paper WeDT13.8
4D-ROLLS: 4D Radar Occupancy Learning Via Lidar Supervision

Liu, Ruihan	Harbin Institute of Technology
Wu, Xiaoyi	Harbin Institute of Technology, Shenzhen
Chen, Xijun	Harbin Institute of Technology, Shenzhen
Hu, Liang	Harbin Institute of Technology, Shenzhen
Lou, Yunjiang	Harbin Institute of Technology, Shenzhen
Keywords: Deep Learning for Visual Perception, Mapping Abstract: A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at https://github.com/CLASS-Lab/4D-ROLLS.


WeDT14	311D
Deep Learning Methods 5	Regular Session
Chair: Ji, Daxiong	Zhejiang University
Co-Chair: Bian, Gui-Bin	Institute of Automation, Chinese Academy of Sciences

16:40-16:45, Paper WeDT14.1
DRL-DCLP: A Deep Reinforcement Learning-Based Dimension-Configurable Local Planner for Robot Navigation

Zhang, Wei	Eastern Institute of Technology, Ningbo
Wang, Shanze	The Hong Kong Polytechnic University
Tan, Mingao	Eastern Institute of Technology, Ningbo
Yang, Zhibo	National University of Singapore
Wang, Xianghui	Eastern Institute of Technology, Ningbo, China
Shen, Xiaoyu	Eastern Institute of Technology, Ningbo, China
Keywords: AI-Enabled Robotics, Motion Control, Collision Avoidance Abstract: In this paper, we present a deep reinforcement learning-based dimension-configurable local planner (DRL-DCLP) for solving robot navigation problems. DRL-DCLP is the first neural-network local planner capable of handling rectangular differential-drive robots with varying dimension configurations without requiring post-fine-tuning. While DRL has shown excellent performance in enabling robots to navigate complex environments, it faces a significant limitation compared to conventional local planners: dimension-specificity. This constraint implies that a trained controller for a specific configuration cannot be generalized to robots with different physical dimensions, velocity ranges, or acceleration limits. To overcome this limitation, we introduce a dimension-configurable input representation and a novel learning curriculum for training the navigation agent. Extensive experiments demonstrate that DRL-DCLP facilitates successful navigation for robots with diverse dimensional configurations, achieving superior performance across various navigation tasks.

16:45-16:50, Paper WeDT14.2
Prediction of Delay-Free Scene for Quadruped Robot Teleoperation: Integrating Delayed Data with User Commands

Ha, Seunghyeon	Dongguk University
Kim, Seongyong	Dongguk University
Lim, Soo-Chul	Dongguk University
Keywords: Deep Learning Methods, Telerobotics and Teleoperation, Visual Learning Abstract: Teleoperation systems are utilized in various controllable systems, including vehicles, manipulators, and quadruped robots. However, during teleoperation, communication delays can cause users to receive delayed feedback, which reduces controllability and increases the risk faced by the remote robot. To address this issue, we propose a delay-free video generation model based on user commands that allows users to receive real-time feedback despite communication delays. Our model predicts delay-free video by integrating delayed data (video, point cloud, and robot status) from the robot with the user's real-time commands. The LiDAR point cloud data, which is part of the delayed data, is used to predict the contents of areas outside the camera frame during robot rotation. We constructed our proposed model by modifying the transformer-based video prediction model VPTR-NAR to effectively integrate these data. For our experiments, we acquired a navigation dataset from a quadruped robot, and this dataset was used to train and test our proposed model. We evaluated the model's performance by comparing it with existing video prediction models and conducting an ablation study to verify the effectiveness of its utilization of command and point cloud data.

16:50-16:55, Paper WeDT14.3
Spatiotemporal Dual-Stream Network for Visual Odometry

Xu, Chang	Beijing Normal University
Zeng, Taiping	Fudan University
Luo, Yifan	Hangzhou Institute for Advanced Study, UCAS
Song, Fei	Shenyang Institute of Automation Chinese Academy of Sciences
Si, Bailu	Beijing Normal University
Keywords: Deep Learning Methods, Deep Learning for Visual Perception Abstract: Visual Odometry (VO) empowers robots with the ability to perform self-localization within unknown environments using visual cues, yet it is faced with challenges in dynamic environments. In this study, we propose a novel monocular visual odometry network called Spatiotemporal Dual-stream Network(STDN-VO) with two parallel streams, i.e. spatial stream and temporal stream, to model spatiotemporal correlation in the image sequences. Technically, the spatial stream is responsible for extracting global context information from an image, while the temporal stream is designed to effectively extract robust temporal context information from consecutive frames. The outputs of the spatial stream and the temporal stream are merged and then fed to a pose head for predicting the relative pose. Experimental results on the KITTI dataset demonstrate competitive pose estimation performance exceeding published deep learningbased methods. These results underscore the effectiveness of the proposed framework for visual odometry.

16:55-17:00, Paper WeDT14.4
ConditionNET: Learning Preconditions and Effects for Execution Monitoring

Sliwowski, Daniel	TU Wien
Lee, Dongheui	Technische Universität Wien (TU Wien)
Keywords: Deep Learning Methods, Data Sets for Robot Learning, Deep Learning for Visual Perception Abstract: The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to optimize for consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments. The data is available on the project website: https://dsliwowski1.github.io/ConditionNET_page.

17:00-17:05, Paper WeDT14.5
MgCNL: Multi-Granularity Balls for Fault Diagnosis with Noisy Labels (I)

Dunkin, Fir	Southeast University
Li, Xinde	Southeast University
Heqing, Li	Southeast University
Wu, Guoliang	Southeast University
Hu, Chuanfei	University of Shanghai for Science and Technology
Ge, Shuzhi Sam	National University of Singapore
Keywords: Deep Learning Methods, Failure Detection and Recovery, Cognitive Modeling Abstract: Fault diagnosis using supervised learning has achieved remarkable progress in industrial scenarios. However, noisy labels—commonly introduced by automated annotation or human error—can severely compromise model reliability. To address this, we propose MgCNL, a confidence-aware training strategy that employs multi-granularity balls to dynamically assess label trustworthiness and guide the model to focus on high-confidence samples. Furthermore, we incorporate a supervised contrastive learning scheme without negative pairs to enhance the robustness of feature representations, thereby improving generalization under unknown noise rates. Extensive experiments across diverse datasets and model architectures validate the superiority of MgCNL over state-of-the-art noise-robust methods. Beyond performance metrics, the proposed approach offers practical advantages for real-world deployment in robotics applications, such as condition monitoring, predictive maintenance, and safety-critical decision-making. By enabling more reliable perception and fault understanding from imperfect data, MgCNL contributes to advancing the human-robotics frontier—aligning with the vision of resilient, intelligent systems emphasized by IROS 2025.

17:05-17:10, Paper WeDT14.6
TSAN: A New Deep Learning-Based Detection Method for Sensor Anomaly in Mobile Robots (I)

He, Zhitao	Zhejiang University of Technology
Chen, Yongyi	Zhejiang University of Technology
Zhao, Zhao Yang	Zhejiang University of Technology
Liu, Andong	Zhejiang University of Technology
Zhang, Dan	Zhejiang University of Technology
Zhang, Hui	Beihang University
Keywords: Deep Learning Methods Abstract: In the area of robotic systems, the detection of anomalies is a crucial capability for achieving long-term autonomy (LTA) of robots, as this capability ensures the stable operation of robots over extended periods. Adversary can launch injection attacks to interfere with various sensors, including speed and acceleration, thereby inducing abnormalities into the robot's operation. To address this gap, this paper proposes a novel attention mechanism network, namely, Temporal Shuffle Attention Network (TSAN), which efficiently analyzes time-series data obtained from mobile robot's internal sensors. TSAN combines the strengths of global temporal attention (GTA) and external attention (EA) for effective temporal feature extraction. By integrating these attention mechanisms and adding positional encoding, TSAN enhances the representation of time-domain information, facilitating the effective extraction of temporal data features. By exploiting the benefits of time-domain feature extraction, TSAN aims to enhance the performance of anomaly detection. The efficacy of TSAN is verified by performing a real experimental study on the mobile robot in the lab. It is shown that TSAN exhibits excellent detection performance for various types of anomaly scenarios. Moreover, comparisons on the multi-type anomaly detection performance with other methods are carried out, which demonstrate the superiority of the proposed TSAN method.

17:10-17:15, Paper WeDT14.7
Online Fault Diagnosis Using Bio-Inspired Spike Neural Network (I)

Xu, Lie	Zhejiang University
Ji, Daxiong	Zhejiang University
Keywords: Deep Learning Methods Abstract: Data-driven fault diagnosis methods suffer from restrictive assumptions that hinder adaptability to varying work conditions, rendering offline modes insufficient. Addressing this challenge, a bioinspired spike neural network (bio-SNN) is proposed, featuring an innovative online learning mode. The network employs a novel spike encoding method with data compression for efficiently transforming time series data into spike sequences. This encoding method involves converting 1-D series data into a 2-D spectrogram using a filter bank, incorporating a stochastic spike rate for a more flexible representation compared to precise spike rates. The application of a biologically plausible learning rule, specifically spike timingdependent plasticity (STDP), enhances the adaptability of the network. A horizontal inhibition and homeostasis mechanism are also introduced, facilitating effective online updating of synaptic weights. Experimental results on two well-established fault datasets showcase the advantages of the bio-SNN method over existing approaches, highlighting its potential for robust and adaptive fault diagnosis in practical scenarios.


WeDT15	206
Autonomous Vehicle Navigation 2	Regular Session
Chair: Xiao, Xuesu	George Mason University
Co-Chair: Xu, Chao	Zhejiang University

16:40-16:45, Paper WeDT15.1
ExpliDrive: Bridging Model Predictive Control and Transformers for Interactive Autonomous Driving

Lian, Zhexi	Tongji University
Yan, Xuerun	Tongji University
Bi, Ruiang	Tongji University
Wang, Haoran	Tongji University
Hu, Jia	Tongji University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Intelligent Transportation Systems Abstract: Autonomous driving (AD) continues to grapple with the complexity of dynamic and interactive traffic environments, where the primary difficulty stems from insufficient modeling of inter-vehicle interactions—particularly, how autonomous agents should perceive and respond to surrounding vehicles’ influence. To address this, this paper proposed ExpliDrive, an explainable data-driven approach for interaction-aware autonomous driving. Its highlights lie in bridging Model Predictive Control (MPC) and Transformers. The proposed approach builds a generalized system dynamics in which interaction effects between vehicles are explicitly modeled. Specifically, a Transformer encoder-decoder is employed to encode the interaction patterns among vehicles, and these learned effects are seamlessly embedded into the motion planning process. Hence, the proposed approach bears following features: i) enabling proactively interaction-aware autonomous driving; ii) data-driven yet explainable; iii) integrating the prediction into motion planning. Open-looped evaluation demonstrates the proposed approach achieves the lowest prediction errors, from ADE@1s (0.16m) to ADE@5s (0.80m). Close-looped planning shows the proposed approach has significant benefits in driving success rate and flexibility.

16:45-16:50, Paper WeDT15.2
Multimodal Integrated Prediction and Decision-Making with Adaptive Interaction Modality Explorations

Li, Tong	Hong Kong University of Science and Technology
Zhang, Lu	Hong Kong University of Science and Technology
Liu, Sikang	DJI
Shen, Shaojie	Hong Kong University of Science and Technology
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: Navigating dense and dynamic environments poses a significant challenge for autonomous driving systems, owing to the intricate nature of multimodal interaction, wherein the actions of various traffic participants and the autonomous vehicle are complex and implicitly coupled. In this paper, we propose a novel framework, Multimodal Integrated predictioN and Decision-making (MIND), which addresses the challenges by efficiently generating joint predictions and decisions covering multiple distinctive interaction modalities. Specifically, MIND leverages learning-based scenario predictions to obtain integrated predictions and decisions with socially-consistent interaction modality and utilizes a modality-aware dynamic branching mechanism to generate scenario trees that efficiently capture the evolutions of distinctive interaction modalities with low growth of interaction uncertainty along the planning horizon. The scenario trees are seamlessly utilized by the contingency planning under interaction uncertainty to obtain clear and considerate maneuvers accounting for multimodal evolutions. Comprehensive experimental results in the closed-loop simulation based on the real-world driving dataset showcase superior performance to other strong baselines under various driving contexts.

16:50-16:55, Paper WeDT15.3
Scalable Offline Metrics for Autonomous Driving

Aich, Animikh	Boston University
Kulkarni, Adwait	Boston University
Ohn-Bar, Eshed	Boston University
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems, Performance Evaluation and Benchmarking Abstract: Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.

16:55-17:00, Paper WeDT15.4
M2P2: A Multi-Modal Passive Perception Dataset for Off-Road Mobility in Extreme Low-Light Conditions

Datar, Aniket	George Mason University
Pokhrel, Anuj	George Mason University
Nazeri, Mohammad	George Mason University
Balaji Rao, Madhan	George Mason University
Rangwala, Harsh	George Mason University
Pan, Chenhui	George Mason University
Zhang, Yufan	George Mason University
Harrison, Andre	U.S. Army Research Laboratory
Wigness, Maggie	U.S. Army Research Laboratory
Osteen, Philip	U.S. Army Research Laboratory
Ye, Jinwei	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Vision-Based Navigation Abstract: Long-duration, off-road, autonomous missions require robots to continuously perceive their surroundings regardless of the ambient lighting conditions. Most existing autonomy systems heavily rely on active sensing, e.g., LiDAR, RADAR, and Time-of-Flight sensors, or use (stereo) visible light imaging sensors, e.g., color cameras, to perceive environment geometry and semantics. In scenarios where fully passive perception is required and lighting conditions are degraded to an extent that visible light cameras fail to perceive, most downstream mobility tasks such as obstacle avoidance become impossible. To address such a challenge, this paper presents a Multi-Modal Passive Perception dataset, M2P2, to enable off-road mobility in low-light to no-light conditions. We design a multi-modal sensor suite including thermal, event, and stereo RGB cameras, GPS, two Inertia Measurement Units (IMUs), as well as a high-resolution LiDAR for ground truth, with a multi-sensor calibration procedure that can efficiently transform multi-modal perceptual streams into a common coordinate system. Our 10-hour, 32 km dataset also includes mobility data such as robot odometry and actions and covers well-lit, low-light, and no-light conditions, along with paved, on-trail, and off-trail terrain. Our results demonstrate that off-road mobility and scene understanding under degraded visual environments is possible through only passive perception in extreme low-light conditions.

17:00-17:05, Paper WeDT15.5
Agile Plane Transition of a Hexapod Climbing Robot

Gong, Chengzhang	Zhejiang University
Fan, Li	Huzhou Institude of Zhejiang University, Zhejiang University
Xu, Chao	Zhejiang University
Wang, Dacheng	Huzhou Transportation Technology Development CO., LTD
Keywords: Climbing Robots, Legged Robots, Motion and Path Planning Abstract: Traversing across adjacent planes is an important ability for legged climbing robots. While many robots can achieve autonomous ground-to-wall transitions, most are limited to scenarios where the angle between the planes has a certain value. In some cases, however, the robot needs to traverse planes with a wide variety of angles. To enhance the adaptability of the robot in such diverse scenarios, we analyze the plane transition process and propose a universal methodology for hexapod climbing robots with a two-stage workflow. In the first stage, we plan a trajectory of body without considering configuration of legs, within a reachable map. This low-dimensional map can be efficiently sampled and explored to identify feasible transitions. In the second stage, we use a motion prediction to generate landing points, as well as swing and stance trajectories for each leg. By tracking these trajectories, the robot can autonomously transition from one plane to another. Guided by this methodology, we design a hexapod climbing robot capable of autonomously traversing planes with angles ranging from 30◦ to 270◦. For further validation, we build the physical prototype of the robot and conduct a series of plane transition experiments. The results demonstrate the feasibility of both our methodology and the robot.

17:05-17:10, Paper WeDT15.6
Adaptive Large-Scale Novel View Image Synthesis for Autonomous Driving Datasets

Xue, Yiheng	Southern University of Science and Technology
Lyu, Zhijun	Southern University of Science and Technology
Ma, Rui	Guangxi Normal University
Xie, Yuezhen	Southern University of Science and Technology
Hao, Qi	Southern University of Science and Technology
Keywords: Mapping, RGB-D Perception, Data Sets for Robotic Vision Abstract: Novel view image synthesis for large-scale outdoor traffic scenes presents significant challenges, including inaccurate depth measurements, moving objects, wide-angle rendering requirements, and the increased demand for memory and computational resources. In this paper, we propose an adaptive pipeline that constructs high-fidelity 3D surfel models and synthesizes realistic novel views in real time. Our contributions are threefold: 1) developing depth-refinement and moving-object-removal techniques to robustly reconstruct surfel-based scene geometry, while minimizing computational overhead; 2) developing a self-adaptive rendering mechanism which adjusts surfel geometry for large-scale scenes within constrained memory; 3) developing a hyper-parameter tuning approach for optimal surfel construction and rendering performance. An optional GAN-based inpainting module fills missing backgrounds (e.g., sky). Experiments on the KITTI dataset and CARLA simulator show that our method achieves image quality comparable to SOTA NeRF and 3D Gaussian Splatting techniques with significantly improved computational efficiency. This makes our approach particularly well-suited for large-scale traffic scenarios. Our simulation datasets with ground-truth data and source code are available at https://github.com/Billy1203/SurfelMapping.


WeDT16	207
Computer Vision for Automation and Manufacturing	Regular Session
Chair: Tang, Guangzhi	Maastricht University
Co-Chair: Fang, Qiu	Hunan University

16:40-16:45, Paper WeDT16.1
Context-Aware Sparse Spatiotemporal Learning for Event-Based Vision

Wang, Shenqi	Delft University of Technology
Tang, Guangzhi	Maastricht University
Keywords: Computer Vision for Automation, Deep Learning Methods Abstract: Event-based camera has emerged as a promising paradigm for robot perception, offering advantages with high temporal resolution, high dynamic range, and robustness to motion blur. However, existing deep learning-based event processing methods often fail to fully leverage the sparse nature of event data, complicating their integration into resource-constrained edge applications. While neuromorphic computing provides an energy-efficient alternative, spiking neural networks struggle to match of performance of state-of-the-art models in complex event-based vision tasks, like object detection and optical flow. Moreover, achieving high activation sparsity in neural networks is still difficult and often demands careful manual tuning of sparsity-inducing loss terms. Here, we propose Context-aware Sparse Spatiotemporal Learning (CSSL), a novel framework that introduces context-aware thresholding to dynamically regulate neuron activations based on the input distribution, naturally reducing activation density without explicit sparsity constraints. Applied to event-based object detection and optical flow estimation, CSSL achieves comparable or superior performance to state-of-the-art methods while maintaining extremely high neuronal sparsity. Our experimental results highlight CSSL's crucial role in enabling efficient event-based vision for neuromorphic processing.

16:45-16:50, Paper WeDT16.2
IGaussian: Real-Time Camera Pose Estimation Via Feed-Forward 3D Gaussian Splatting Inversion

Wang, Hao	Beijing University of Posts and Telecommunications
Zhao, Linqing	Tsinghua University
Xu, Xiuwei	Tsinghua University
Lu, Jiwen	Tsinghua University
Yan, Haibin	Beijing University of Posts and Telecommunications
Keywords: Computer Vision for Automation Abstract: Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative render-compare-refine loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10× speedup compared to optimization-based approaches. Project page: https://github.com/pythongod-exe/iGaussian

16:50-16:55, Paper WeDT16.3
Vision-Driven 2D Supervised Fine-Tuning Framework for Bird’s Eye View Perception

He, Lei	Tsinghua University
Wang, Qiaoyi	Tsinghua University
Sun, Honglin	Waseda University
Xu, Qing	Tsinghua University
Gao, Bolin	Tsinghua University
Li, Shengbo Eben	Tsinghua University
Wang, Jianqiang	Tsinghua University
Li, Keqiang	Tsinghua University
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Visual Learning Abstract: Visual bird’s eye view (BEV) perception, dute to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Additionally, most mass-produced autonomous driving systems are equipped solely with surround camera sensors and lack the LiDAR data necessary for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model’s generalization capabilities in new scene data. Leveraging the maturity of 2D perception technologies, our method utilizes only 2D semantic segmentation labels and monocular depth estimations, thereby significantly reducing the dependence on expensive BEV ground truths and offering strong potential for industrial deployment. Extensive experiments and comparative analyses on the nuScenes and Waymo datasets demonstrate the effectiveness of our method. Specifically, it improves mAP and NDS by 2.51% and 1.93% on nuScenes, and by 1.21% and 0.78% on Waymo, respectively, validating its practical utility and robustness across diverse domains.

16:55-17:00, Paper WeDT16.4
Viewpoint Planning for Active 3D Reconstruction of Freeform Surface Parts Based on DDPG-3DCNN (I)

Ye, Jun	Hunan University
Fang, Qiu	Hunan University
Peng, Weixing	Hunan University
Zhong, Fuqiang	Hunan University
Wang, Yaonan	Hunan University
Keywords: Factory Automation, Reinforcement Learning, Computer Vision for Automation Abstract: Three-dimensional (3D) measurement represents a promising approach for assessing the quality of freeform surface parts with stringent precision requirements, where viewpoint planning assumes a critical role in 3D reconstruction. Prevailing viewpoint planning methods typically operate within discrete spaces, which can result in the loss of critical details due to quantization errors, particularly when applied to products with intricate surface structures. Moreover, conventional approaches encounter difficulties in controlling the overlap rate of the point cloud due to the unique characteristics of freeform surface structures, resulting in diminished registration accuracy of the point cloud. This paper introduces a reinforcement learning framework for the viewpoint planning of freeform surface parts with high precision requirements, utilizing the Deep Deterministic Policy Gradient algorithm with 3D convolutional structures to address optimal viewpoints in a continuous space. Effective control over the point cloud overlap rate is achieved through the design of a unique reward function. Furthermore, constraints related to the eye-in-hand robot measurement equipment are formulated. Through real experiments, our method demonstrates a substantial improvement in point cloud overlap rate control accuracy, surpassing existing methods by an order of magnitude. Our approach achieves higher coverage using fewer viewpoints compared to other reinforcement learning-based viewpoint planning methods. Additionally, it exhibits generalization capabilities and can assess models that have not been previously encountered.

17:00-17:05, Paper WeDT16.5
Adaptive Neural Uncalibrated Visual Servo with Zero-Shot Transfer of Extrinsics and Scenes

Chen, Anzhe	Zhejiang University
Li, Shuxin	Zhejiang University
Yu, Hongxiang	Zhejiang University
Zhou, Zhongxiang	Zhejiang University
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Computer Vision for Manufacturing, Visual Servoing, Perception for Grasping and Manipulation Abstract: Deploying visual servo controller to novel scenes with uncertain parameters requires additional manual effort for calibration. Traditional methods tackle this problem by online estimating the Jacobian matrix. However, they struggle in challenging scenes due to intrinsic limitations. For instance, image-based uncalibrated visual servo requires tracking a fixed set of points, which is impractical in texture-less scenes. Position-based uncalibrated visual servo necessitates absolute scale of translation, which requires depth sensor or model-based pose estimator, introducing extra hardware cost or model complexity. Recent advances in neural network-based visual servoing have shown improvement in convergence, precision and generalization compared to traditional methods. However, the uncalibrated neural visual servo remains underexplored. In this paper, we propose a structured Jacobian estimator for neural-based visual servo controller, enabling zero-shot transfer to novel environments with unknown extrinsic and scene scale. Stability of pose error is analyzed under the bounded calibration error assumption. Moreover, we propose an automatic control gain scheduler to accelerate the convergence while maintaining high success rate and precision. The scheduling behavior is analyzed through greedy optimal control. Our method is validated with simulated and real-world experiments.

17:05-17:10, Paper WeDT16.6
Improving 6D Object Pose Estimation of Metallic Household and Industry Objects

Pöllabauer, Thomas	TU Darmstadt
Gasser, Michael	TU Darmstadt
Wirth, Tristan	TU Darmstadt
Berkei, Sarah	Threedy GmbH
Knauthe, Volker	TU Darmstadt
Kuijper, Arjan	TU Darmstadt, Fraunhofer Institute for Computer Graphics Researc
Keywords: Computer Vision for Manufacturing, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: 6D object pose estimation suffers from reduced accuracy when applied to metallic objects. We set out to improve the state-of-the-art by addressing challenges such as reflections and specular highlights in industrial applications. Our novel BOP-compatible dataset, featuring a diverse set of metallic objects (cans, household, and industrial items) under various lighting and background conditions, provides additional geometric and visual cues. We demonstrate that these cues can be effectively leveraged to enhance overall performance. To illustrate the usefulness of the additional features, we improve upon the GDRNPP algorithm by introducing an additional keypoint prediction and material estimator head in order to improve spatial scene understanding. Evaluations on the new dataset show improved accuracy for metallic objects, supporting the hypothesis that additional geometric and visual cues can improve learning.


WeDT17	210A
Intelligent Transportation Systems 4	Regular Session
Chair: De Martini, Daniele	University of Oxford

16:40-16:45, Paper WeDT17.1
Learning Predictive Control with Online Modeling for Agile Maneuvering of Autonomous Vehicles

Yin, Xin	College of Intelligence Science and Technology, National Univers
Zhang, Zengyi	China Cec Engineering Corporation
Cao, Haotian	National University of Defense Technology
Liu, Tenglong	National University of Defense Technology
Lan, Yixing	National University of Defense Technology
Xu, Xin	National University of Defense Technology
Zhang, Xinglong	National University of Defense Technology
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Motion Control Abstract: The agile maneuvering control of autonomous vehicles (AVs) requires the tracking of reference trajectories characterized by high acceleration, sharp curvature, considerable disturbances, and significant time-varying, all while ensuring stability and accuracy. The inherent uncertainty and time-varying nature of both the vehicle model and its environment pose significant challenges to achieving high-performance tracking during agile maneuvers. Developing a control algorithm that enables solving the optimal policy for nonlinear systems with uncertainties is critical. In this paper, we propose a learning-based predictive control approach, namely, an adaptive model predictive control (AMPC) with Actor-Critic Learning (ACL) for generating closed-loop MPC policies for agile maneuvering of AVs. The proposed approach leverages neural networks to model the dynamics uncertainties online. The control policy and model are updated simultaneously to realize performance optimization under time-varying uncertainties. Simulation results demonstrate that our proposed algorithm outperforms other leading ACL methods, as well as MPC and Linear Quadratic Regulator (LQR). Furthermore, field test experiment results validate its effectiveness on the HongQi-EHS3 electric vehicle, showing superior control performance compared to MPC both on paved roads and curved off-roads with excellent stability performance.

16:45-16:50, Paper WeDT17.2
Joint Pedestrian and Vehicle Traffic Optimization in Urban Environments Using Reinforcement Learning

Poudel, Bibek	University of Tennessee Knoxville
Wang, Xuan	George Mason University
Li, Weizi	University of Tennessee, Knoxville
Zhu, Lei	University of North Carolina, Charlotte
Heaslip, Kevin	University of Tennessee Knoxville
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Optimization and Optimal Control Abstract: Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52% respectively, while simultaneously decreasing total wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL's potential for developing transportation systems that serve all road users.

16:50-16:55, Paper WeDT17.3
Large-Scale Mixed-Traffic and Intersection Control Using Multi-Agent Reinforcement Learning

Liu, Songyang	University of Florida
Fan, Muyang	University of Memphis
Li, Weizi	University of Tennessee, Knoxville
Du, Jing	University of Florida
Li, Shuai	University of Florida
Keywords: Intelligent Transportation Systems, Multi-Robot Systems, Reinforcement Learning Abstract: Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small-scale networks or isolated intersections, leaving large-scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi-agent reinforcement learning for large-scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real-world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17 s to 5.09 s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning-based control large-scale traffic can improve overall efficiency and may inform future urban planning strategies.

16:55-17:00, Paper WeDT17.4
Adapt-VRPD: Vehicle Routing Problem with Drones under Dynamically Changing Traffic Conditions

Imran, Navid Mohammad	William Paterson University
Won, Myounggyu	University of Memphis
Keywords: Intelligent Transportation Systems, Logistics, Automation Technologies for Smart Cities Abstract: The vehicle routing problem with drones (VRPD) involves determining the optimal routes for trucks and drones to collaboratively deliver parcels to customers, aiming to minimize total operational costs. While various heuristic algorithms have been developed to address the problem, existing solutions are built based on simplistic cost models, overlooking the temporal dynamics of the costs, which fluctuate depending on the dynamically changing traffic conditions. In this paper, we present a novel problem called the vehicle routing problem with drones under dynamically changing traffic conditions (texttt{Adapt-VRPD}) to address the limitation of existing VRPD solutions. We design a novel cost model that factors in the actual travel distance and projected travel time, computed using a machine learning-driven travel time prediction algorithm. A variable neighborhood descent (VND) algorithm is developed to find the optimal truck-drone routes under the dynamics of traffic conditions through incorporation of the travel time prediction model. A simulation study was performed to compare our algorithm with a state-of-the-art VRPD heuristic. Our algorithm outperformed the benchmark, reducing the average and maximum discrepancies from the actual cost by 37.6% and 27.6%, respectively, across various delivery scenarios.

17:00-17:05, Paper WeDT17.5
MATRICS: A Multi-Agent Deep Reinforcement Learning-Based Traffic-Aware Intelligent Lane-Change System

Das, Lokesh Chandra	Wichita State University
Won, Myounggyu	University of Memphis
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Task and Motion Planning Abstract: We present MATRICS, a traffic-aware multi-agent reinforcement learning (MARL)-based intelligent lane-change system designed for autonomous vehicles (AVs). While existing research primarily focuses on enhancing the local impact of the ego vehicle's lane-change decisions, MATRICS stands out by optimizing both local and global performance, i.e., aiming not only to improve the traffic efficiency, driving safety, and driver comfort of the ego vehicle, but also to enhance overall traffic flow within a designated road segment. Through an extensive review of the transportation literature, we construct a novel state space integrating local traffic information collected from surrounding vehicles and global traffic data obtained from road-side units (RSUs). We develop a reward function to guide judicious lane-change decisions, considering both ego vehicle performance and traffic flow enhancement. Our local density-aware multi-agent double deep Q-network (DDQN) algorithm facilitates effective cooperation among agents in executing lane-change maneuvers. Simulation results demonstrate MATRICS' superior performance across metrics of traffic efficiency, driving safety, and driver comfort in comparison with a state-of-the-art MARL model.

17:05-17:10, Paper WeDT17.6
GraphSCENE: On-Demand Critical Scenario Generation for Autonomous Vehicles in Simulation

Panagiotaki, Efimia	University of Oxford
Pramatarov, Georgi	University of Oxford
Kunze, Lars	UWE Bristol
De Martini, Daniele	University of Oxford
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Simulation and Animation Abstract: Testing and validating Autonomous Vehicle (AV) performance in safety-critical and diverse scenarios is crucial before real-world deployment. However, manually creating such scenarios in simulation remains a significant and time-consuming challenge. This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences, such as AV actions, sets of dynamic agents, and criticality levels. A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle, agents, and static structures, guided by real-world spatiotemporal interaction patterns and constrained by an ontology that restricts predictions to semantically valid links. Our model consistently outperforms the baselines in accurately generating links corresponding to the requested scenarios. We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.

17:10-17:15, Paper WeDT17.7
An Efficient Real-Time Railway Container Yard Management Method Based on Partial Decoupling (I)

Luan, Yao	Tsinghua University
Jia, Qing-Shan	Tsinghua University
Xing, Yi	China Railway Signal and Communication (CRSC) Research and Desig
Li, Zhiyu	China Railway Signal and Communication (CRSC) Research and Desig
Wang, Tengfei	China Railway Signal and Communication (CRSC) Research and Desig
Keywords: Foundations of Automation, Reinforcement Learning, Logistics Abstract: Sea-rail intermodal transportation is an essential infrastructure in global supply chains nowadays. Since the yard is the interface between sea and land, optimizing the transportation process in yards is of significant interest for increasing transportation efficiency. However, the yard management problem usually suffers from the large state space and cannot be solved effectively. We consider this important problem from a decomposition perspective, focusing on real-time scheduling in railway container yards, and make the following contributions. We show the partial decomposition property of the yard management problem. This property establishes the equivalence between joint optimization of yard components and independent optimization of each component under mild assumptions, achieving optimality and effectiveness simultaneously. Utilizing this property, we further simplify each subproblem while preserving the optima, and develop algorithms for each simplified subproblem. Numerical experiments demonstrate the performance of proposed algorithms and the contribution of optimizing each subproblem from various perspectives.


WeDT18	210B
Probability and Statistical Methods	Regular Session
Co-Chair: Zhao, Lin	National University of Singapore

16:40-16:45, Paper WeDT18.1
Inverse Kinematics for Robot Arm Using Minimum Mean Square Error

Shin, Changeui	LG Electronics
Park, Junho	LG Electronics
Jeong, Woong	LG Electronics
Lee, Jaewook	LG Electronics
Joo, Youngjun	Sookmyung Women's University
Kwak, Hoseong	LG Elec
Keywords: Probability and Statistical Methods, Kinematics, Redundant Robots Abstract: This paper considers the inverse kinematics problem of a robotic arm applying minimum mean square error with variance-based control. The proposed algorithm achieves optimal results by minimizing the average error, even when considering variance calculations. Its performance is comparable to that of the algorithm that utilizes optimally tuned singular value decomposition (SVD). The calculated variance values are added to the diagonal terms of the matrix as in the damped least squares method in the inverse matrix operation. This indicates that optimal performance can be achieved even when a Moore-Penrose pseudo-inverse matrix is employed instead of SVD. The effectiveness of the proposed method is validated with seven-degree-of-freedom (7-DoF) (1 rail + 6-DoF arm) and 6-DoF robots. By introducing practical error control methods, this paper contributes to enhancing the overall comprehension of error-related algorithms.

16:45-16:50, Paper WeDT18.2
Masked Autoencoders Are Robust Task Offloaders for Timely and Accurate Inference

Lee, Wonyeong	Sungkyunkwan University
Lee, Seunghoon	Sungkyunkwan University
Cho, Seungyeon	SungKyunKwan University
Koo, Hyunwoo	Sungkyunkwan University
Chwa, Hoon Sung	DGIST
Lee, Jinkyu	Sungkyunkwan University
Keywords: Robotics in Hazardous Fields, Distributed Robot Systems, Computer Vision for Transportation Abstract: Edge devices for robotics in hazardous environments, such as rescue drones, navigate complex terrains while transmitting images to remote servers for anomaly detection, including wildfires. However, these devices operate under strict resource constraints, prioritizing operational-critical tasks (e.g., autonomous navigation) while handling image-processing workloads with minimal overhead. Offloading computation to a remote server can alleviate this burden, but unstable network conditions can degrade accuracy and timeliness. To address these challenges, this paper presents a novel offloading framework that balances computational efficiency and accuracy in image-processing tasks. Specifically, it ensures (R1) a minimum accuracy level for individual image-processing tasks associated with different camera sensors and (R2) maximizes the overall image-processing accuracy across all sensors. Our approach builds on an edge-server collaborative image reconstruction architecture, where images are divided into patches and selectively reconstructed. To achieve R1 and R2, we introduce: (i) a hierarchical scheduler that effectively prioritizes patch transmissions under resource constraints and (ii) a feedback mechanism that adapts to network instability, ensuring reliable offloading and inference. Experimental results demonstrate that our framework maintains high accuracy and timely processing, even under network failures.

16:50-16:55, Paper WeDT18.3
Perpetua: Multi-Hypothesis Persistence Modeling for Semi-Static Environments

Saavedra, Miguel	Université De Montréal
Nashed, Samer	University of Massachusetts Amherst
Gauthier, Charlie	Mila, University of Montreal
Paull, Liam	Université De Montréal
Keywords: Probability and Statistical Methods, Probabilistic Inference, Mapping Abstract: Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of "persistence" and "emergence" filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.

16:55-17:00, Paper WeDT18.4
DnD Filter: Differentiable State Estimation for Dynamic Systems Using Diffusion Models

Wan, Ziyu	National University of Singapore
Zhao, Lin	National University of Singapore
Keywords: Sensorimotor Learning, Machine Learning for Robot Control, Probabilistic Inference Abstract: This paper proposes the DnD Filter, a differentiable filter that utilizes diffusion models for state estimation of dynamic systems. Unlike conventional differentiable filters, which often impose restrictive assumptions on process noise (e.g., Gaussianity), DnD Filter enables a nonlinear state update without such constraints by conditioning a diffusion model on both the predicted state and observational data, capitalizing on its ability to approximate complex distributions. We validate its effectiveness on both a simulated task and a real-world visual odometry task, where DnD Filter consistently outperforms existing baselines. Specifically, it achieves a 25% improvement in estimation accuracy on the visual odometry task compared to state-of-the-art differentiable filters, and even surpasses differentiable smoothers that utilize future measurements. To the best of our knowledge, DnD Filter represents the first successful attempt to leverage diffusion models for state estimation, offering a flexible and powerful framework for nonlinear estimation under noisy measurements. The code is available at https://github.com/ZiyuNUS/DnDFilter.

17:00-17:05, Paper WeDT18.5
Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions Using Wilson Score Kernel Density Estimation

Iversen, Thorbjørn Mosekjær	University of Southern Denmark
Sørensen, Lars Carøe	University of Southern Denmark
Mathiesen, Simon Faarvang	University of Southern Denmark
Petersen, Henrik Gordon	University of Southern Denmark
Keywords: Probability and Statistical Methods, Probabilistic Inference Abstract: Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject to unmeasurable disturbances. Bayesian optimization can be used to optimize such methods in an efficient manner by deploying a probabilistic function estimator to estimate with a given confidence so that regions of the search space can be pruned away. Consequently, the success of the Bayesian optimization depends on the function estimator's ability to provide informative confidence bounds. Existing function estimators require many function evaluations to infer the underlying confidence or depend on modeling of the disturbances. In this paper, it is shown that the confidence bounds provided by the Wilson Score Kernel Density Estimator (WS-KDE) are applicable as excellent bounds to any stochastic function with an output confined to the closed interval [0;1] regardless of the distribution of the output. This finding opens up the use of WS-KDE for stable global optimization on a wider range of cost functions. The properties of WS-KDE in the context of Bayesian optimization are demonstrated in simulation and applied to the problem of automated trap design for vibrational part feeders.

17:05-17:10, Paper WeDT18.6
BaTCAVe: Trustworthy Explanations for Robot Behaviors

Sagar, Som	Arizona State University
Taparia, Aditya	Arizona State University
Mankodiya, Harsh	Arizona State University
Bidare, Pranav Ramesh	Arizona State University
Zhou, Yifan	Arizona State University
Senanayake, Ransalu	Arizona State University
Keywords: Probability and Statistical Methods Abstract: Black box neural networks are an indispensable part of modern robots. Nevertheless, deploying such high-stakes systems in real-world scenarios poses significant challenges when the stakeholders, such as engineers and legislative bodies, lack insights into the neural networks' decision-making process. Presently, explainable AI is primarily tailored to natural language processing and computer vision, falling short in two critical aspects when applied in robots: grounding in decision-making tasks and the ability to assess trustworthiness of their explanations. In this paper, we introduce a trustworthy explainable robotics technique based on human-interpretable, high-level concepts that attribute to the decisions made by the neural network. Our proposed technique provides explanations with associated uncertainty scores for the explanation by matching neural network's activations with human-interpretable visualizations. To validate our approach, we conducted a series of experiments with various simulated and real-world robot decision-making models, demonstrating the effectiveness of the proposed approach as a post-hoc, human-friendly robot diagnostic tool.


WeDT19	210C
Biologically-Inspired Robots 4	Regular Session
Chair: Xie, Fengran	Shenzhen Polytechnic University

16:40-16:45, Paper WeDT19.1
Development of a Soft Robotic Fish with Stiffness Modulation and Wriggling Locomotion

Shao, Hua	Wuhan University of Science and Technology
Geng, Jiazi	Wuhan University of Science and Technology
Li, Yiyang	Wuhan University of Science and Technology
Lian, Guoyun	Shenzhen Polytechnic University
Yang, Jinfeng	Shenzhen Polytechnic University
Zuo, Qiyang	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Xu, Yaohui	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Xie, Fengran	Shenzhen Polytechnic University
Keywords: Biologically-Inspired Robots, Biomimetics, Marine Robotics Abstract: Live fish possess the ability to modulate their body stiffness to achieve diverse swimming characteristics, a feature that is largely absent in existing robotic fish designs. Most robotic fish are constrained by rigid or fixed-stiffness bodies, which limit their flexibility, axial modulation capabilities, and ability to navigate confined spaces. This paper presents a novel soft robotic fish capable of stiffness modulation and earthworm-inspired wriggling locomotion. The design incorporates a pneumatic stiffness modulation mechanism, a cable-driven actuation system, and two passive flapping foils near the caudal fin. Two distinct swimming modes are demonstrated: a body and/or caudal fin (BCF) mode with active stiffness modulation and an earthworm-inspired wriggling mode. Experimental results validate the effectiveness of the proposed design in both swimming modes. This work advances the development of soft robotic fish by introducing innovative structural and actuation mechanisms, enhancing flexibility and adaptability for future applications in underwater exploration and related fields.

16:45-16:50, Paper WeDT19.2
Jumping Mechanism Assists Takeoff for Large-Sized Flapping-Wing Robots

Zhang, Zeyu	Southeast University
Li, Xinde	Southeast University
Zhang, Zhentong	Southeast University
Chengxiang, Yu	Southeast University
Zhang, Pengfei	Southeast University
Keywords: Biologically-Inspired Robots, Biomimetics, Sensor Fusion Abstract: Flapping-wing robots exhibit numerous advantages in flight performance, which mimic the natural flight of birds or insects. However, autonomous takeoff remains a significant challenge for large-sized bird-like flapping-wing robots. To address this challenge, we design a jumping mechanism based on a bow-like carbon fiber spring. This mechanism is capable of repeated self-compression and release and can be readily integrated into flapping-wing robots to assist their jumping takeoff. Then we conduct a mechanical analysis of the motion states during the jumping takeoff process. Additionally, we propose a jump-flapping coupling control method based on sensor data to ensure a seamless transition and smooth coordination between the jumping and flapping action, thus enabling a smooth takeoff. Experimental results validate the effectiveness of the proposed jumping mechanism and its collaborative control strategy. This study provides support for advancing flapping-wing robots toward autonomous multi-modal locomotion and further deepens research in this field.

16:50-16:55, Paper WeDT19.3
Multi-Functional Granular Propulsion: Bio-Inspired Orientation Control and Local Fluidization for Crawl-To-Dig Transitions

Li, Dongting	University of California San Diego
Tolley, Michael T.	University of California, San Diego
Gravish, Nick	UC San Diego
Keywords: Biologically-Inspired Robots, Field Robots, Mechanism Design Abstract: Existing robots designed for locomotion in granular media typically excel at a single purpose—either surface travel or subsurface digging—while lacking the ability to perform both within the same platform. In contrast, nature offers various examples of burrowing organisms that exhibit multi-functional digging behaviors by separating their body into two essential parts: a digger for substrate intrusion and rest of the body as anchor for stabilization and controlling digger orientation. Inspired by these biological strategies, we present an extension to an existing Screw Propelled Vehicle (SPV) that incorporates an adjustable body anchor to reduce drag and enable orientation control. This integration allows the robot to transition between horizontal crawling and vertical digging. We also investigate the effect of local fluidization (LF), a bio-inspired technique that temporarily reduces the resistive forces in granular media. Experimental results show that integrating LF improves surface propulsion performance in terms of speed and depth with increment of over 5x compared to the baseline configuration. These findings supports the hypothesis that bio-inspired design principles—specifically body–anchor separation and local fluidization—significantly enhance both the functionality and efficiency of granular locomotion robots, providing a pathway toward more versatile, autonomous, and high-performance subsurface exploration.

16:55-17:00, Paper WeDT19.4
Vision-Guided Loco-Manipulation with a Snake Robot

Salagame, Adarsh	Northeastern University
Gangaraju, Kruthika	Northeastern University
Sihite, Eric	California Institute of Technology
Ramezani, Milad	CSIRO
Ramezani, Alireza	Northeastern University
Keywords: Biologically-Inspired Robots, Biomimetics, Vision-Based Navigation Abstract: This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University's snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-manipulation for transporting objects to specified goal locations. Additionally, we demonstrate open-loop experiments in which COBRA successfully performs real-time object detection and loco-manipulation tasks.

17:00-17:05, Paper WeDT19.5
Servo-Driven Flapping Robot That Uses Its Tail for Self-Standing Takeoff

Anuar, Kaspul	Tokyo Metropolitan University
Amemiya, Asami	Tokyo Metropolitan University
Sato, Hidaka	Tokyo Metropolitan University
Afakh, Muhammad Labiyb	Tokyo Metropolitan University
Yunanto, Bagus	Tokyo Metropolitan University, Politeknik Negeri Semarang
Wada, Kazuyoshi	Tokyo Metropolitan University
Inasawa, Ayumu	Tokyo Metropolitan University
Takesue, Naoyuki	Tokyo Metropolitan University
Keywords: Biologically-Inspired Robots, Biomimetics, Aerial Systems: Mechanics and Control Abstract: In this study, a self-standing takeoff method was developed for a servo-driven flapping robot. This method does not use any additional mechanisms or external platforms to maintain the robot’s position before takeoff. In addition to its efficiency in terms of weight, this method is also relatively easy to implement. This takeoff method extends the function of the tail not only for longitudinal direction control during flight, but also as a base to support the standing posture of the flapping robot. The objective of this investigation was to determine the optimal parameters to be implemented in the self-standing takeoff algorithm. To enhance the probability of successful takeoff, an investigation was conducted into the various parameters that influence the self-takeoff process. The investigative process was initiated with a static experiment to determine the thrust generated by the flapping robot. Subsequently, the variables of the center flapping angle, timing adjustment, and initial flapping direction were examined. A series of indoor flight experiments were conducted to evaluate the self-standing takeoff performance of the flapping robot. The experiment tested two weighted robots, the first 43 g and the second 45 g (thrust/weight (T/W) ratios of 1.02 and 0.97, respectively). The results showed that the proposed takeoff method requires only a T/W ratio of 1.02 for takeoff, less than the 1.2 previously required.

17:05-17:10, Paper WeDT19.6
A Robust Stereo Splatting SLAM System with Inertial-Legged Fusion

Chen, Zuowei	Beijing Institute of Technology
Zhang, Yulai	Beijing Institute of Technology
Li, Chengyang	Beijing Institute of Technology
Li, Shengming	Beijing Institute of Technology
Fukuda, Toshio	Nagoya University
Shi, Qing	Beijing Institute of Technology
Keywords: Biologically-Inspired Robots, SLAM, Legged Robots Abstract: Recent progress in stereo-based 3D Gaussian Splatting (3DGS) SLAM has enabled small-scale robots, which are too small to carry depth cameras, to achieve localization and reconstruct photorealistic scenes with high-speed rendering. However, initializing 3D Gaussians from binocular vision still requires further improvement, and the potential of robot proprioception has not been fully leveraged. This work presents a robust stereo 3DGS SLAM with efficient inertial-legged fusion for small-scale quadruped robots (SaQu-SLAM). We develop a light-weight network to densely initialize the 3D Gaussians in the space. Besides, an efficient fusion method of inertial and legged encoder data based on Kalman filter is introduced. To improve the cross-platform generalization of our algorithm, multiple configuration combinations of these three types of sensors are provided. Moreover, we propose a mode-switching mechanism to handle intermittent visual failures. At last, we perform evaluation on a benchmark dataset, which includes large- and small-scale scenes, and a small quadruped robot in real-world confined-scale scenes, reducing the absolute trajectory error by an average of 19%, 13% and 25% respectively, when compared with other state-of-the-art methods in a similar context. It is also the only successful method in our self-customized confined mixed textured and textureless scene, whereas all vision-based or visual-inertial methods fail. Our system achieves real-time performance even on an embedded platform (Jetson AGX Orin).

17:10-17:15, Paper WeDT19.7
Optimal Trajectory Planning in a Vertically Undulating Snake Locomotion Using Contact-Implicit Optimization

Salagame, Adarsh	Northeastern University
Sihite, Eric	California Institute of Technology
Ramezani, Alireza	Northeastern University
Keywords: Biologically-Inspired Robots, Optimization and Optimal Control, Biomimetics Abstract: Contact-rich problems, such as snake robot locomotion, offer unexplored yet rich opportunities for optimization-based trajectory and acyclic contact planning. So far, a substantial body of control research has focused on emulating snake locomotion and replicating its distinctive movement patterns using shape functions that either ignore the complexity of interactions or focus on complex interactions with matter (e.g., burrowing movements). However, models and control frameworks that lie in between these two paradigms and are based on simple, fundamental rigid body dynamics, which alleviate the challenging contact and control allocation problems in snake locomotion, remain absent. This work makes meaningful contributions, substantiated by simulations and experiments, in the following directions: 1) introducing a reduced-order model based on Moreau's stepping-forward approach from differential inclusion mathematics, 2) verifying model accuracy, 3) experimental validation.

17:15-17:20, Paper WeDT19.8
Comparative Analysis of CSP, CSTP and Max-SNR Filters for P300 Detection in Brain Computer Interface

Piri, Saeid	Imam Reza International University
Wang, Jiachen	Shandong University
Zhang, Huanghe	Shandong University
Keywords: Brain-Machine Interfaces Abstract: Event-related potentials (ERPs) are essential for the development of brain-computer interface (BCI) systems, particularly for their ability to facilitate communication by detecting specific brain activity patterns. To improve the detection accuracy of these signals, advanced filtering techniques are employed to enhance the signal-to-noise ratio (SNR), enabling more reliable classification of ERPs. This study evaluates the performance of three widely used filtering methods—Common Spatial Pattern (CSP), Common Spatio-Temporal Pattern (CSTP), and Max-SNR—in detecting the P300 component, a prominent ERP used in many BCI applications. Building upon the CSTP method, we propose a novel Max-SNR-based spatio-temporal filter designed to leverage both spatial and temporal features of the signal. The features extracted using these filters were classified with the Stepwise Linear Discriminant Analysis (SWLDA) classifier, a commonly adopted method in the BCI domain. Our results demonstrate that the proposed Max-SNR-based spatio-temporal filter outperformed other approaches, achieving an average classification accuracy of 96.0%. These findings highlight the potential of the proposed method to enhance P300 detection and improve the overall efficiency of BCI systems.


WeDT20	210D
Haptics and Haptic Interfaces	Regular Session
Chair: Zhu, Yaonan	University of Tokyo
Co-Chair: Aoyama, Tadayoshi	Nagoya University

16:40-16:45, Paper WeDT20.1
Intuitive Hand Positional Guidance Using McKibben-Based Surface Tactile Sensations to Shoulder and Elbow

Yokoe, Kenta	Nagoya University
Funabora, Yuki	Nagoya University
Aoyama, Tadayoshi	Nagoya University
Keywords: Haptics and Haptic Interfaces, Wearable Robotics, Physically Assistive Devices Abstract: Hand positional guidance with intuitive perception is crucial for enhancing user interaction and task performance in immersive environments. However, conventional hand positional guidance methods, relying on tactile sensations, lack intuitiveness. Consequently, users require instruction on the relationship between the tactile sensation and target position of the guidance before using these methods. Additionally, the user needs training to become familiar with tactile sensations. This study presents a hand positional guidance system with intuitive perception that leverages McKibben-based surface tactile sensations directed to the shoulder and elbow. We developed a wearable fabric actuator that provides McKibben-based surface tactile sensations to induce six specific movements: elbow flexion, extension, shoulder abduction, adduction, horizontal abduction, and horizontal adduction. The effectiveness of the actuator was experimentally validated, demonstrating its high accuracy in intuitively inducing six movements. An algorithm based on the equilibrium point hypothesis and Weber-Fechner law was implemented to regulate the intensity of the tactile sensations for hand positional guidance. Furthermore, the accuracy and speed of the proposed system were compared with that of conventional guidance methods utilizing synthesized speech and vibrotactile guidance.

16:45-16:50, Paper WeDT20.2
Touch-Linked Sleeve: A Haptic Interface for Augmented Tactile Perception in Robotic Teleoperation

Leng, Yatao	ShanghaiTech University
Chen, Yuzhou	ShanghaiTech University
Tang, Ziyuan	ShanghaiTech University
Xiao, Chenxi	ShanghaiTech University
Keywords: Haptics and Haptic Interfaces, Force and Tactile Sensing, Telerobotics and Teleoperation Abstract: Tactile perception is crucial for robots to interact effectively with their environments, particularly in cluttered settings or when visual sensing is unavailable. However, a major limitation is the insufficient coverage of tactile sensors on current robots, which makes navigating cluttered spaces challenging due to the lack of capability to detect collisions. This limitation also hinders the use of teleoperation systems in such spaces by reducing the human operator’s situational awareness. To address this issue, this paper proposes the Touch-Linked Sleeve (TLS), a haptic mapping system that redirects contact on robot arms to human skin. The system consists of a tactile skin for contact detection and a haptic sleeve that enables human operators to experience telepresented contact. By establishing a transparent mapping between the robot's tactile skin and the user's haptic sleeve, operators can intuitively sense contacts from the robot’s perspective. To evaluate the system’s effectiveness, we conducted experiments demonstrating the functionality of both the tactile skin and the haptic sleeve. Moreover, we performed human studies using a virtual reality robot teleoperation interface to simulate navigation and manipulation in a cluttered scenario. The results indicate that the proposed system enhances perceptual transparency during object grasping tasks, leading to improved task completion times, fewer collisions, and improved overall usability.

16:50-16:55, Paper WeDT20.3
Quasi-God Object and Geodesically Restricted 6-DOF Haptic Forces for Compliant Constraints and Low Frequency Simulation

Montesino, Ignacio	Universidad Carlos III De Madrid
Bachiller Gomez, Aroa	Universidad Carlos III De Madrid
Victores, Juan G.	Universidad Carlos III De Madrid
Balaguer, Carlos	Universidad Carlos III De Madrid
Jardon, Alberto	Universidad Carlos Iii De Madrid
Keywords: Haptics and Haptic Interfaces, Computational Geometry, Rehabilitation Robotics Abstract: Haptic interaction plays a crucial role in enhanc- ing realism and immersion in virtual environments, particu- larly in applications such as robotic rehabilitation. In these environments, therapy based on serious games that combine VR and haptic feed promise to offer engaging, affordable and reliable treatments. Most 6-DoF haptic methods have focused on precision in the kinematic fidelity of the haptic feedback. However, rehabilitation games for upper limb focus more on the forces exerted by the patient than the kinematic precision of the contacts with the virtual environment, since these exerted forces will be what powers the neuromuscular recovery of lost movement. We believe that a haptic control method with a focus on providing stable compliant and safe collisions is required to further the research into VR rehabilitation gamification. In addition, these solutions often rely on specifically tailored simulators running at very high frequencies. In contrast, commercial game engine that bring the most capabilities for gamification run at very low frequencies in comparison (≃50 Hz). In this work, we present a novel approach that integrates god object and penalty-based methods to achieve a balance between stability and computational efficiency, to enable 6- DoF haptic force generation on a collaborative robot. This is achieved through a relaxed quasi-god object simulation in the game engine and geodesic constraints in the robot’s haptic loop.

16:55-17:00, Paper WeDT20.4
Design of an Electromagnetically Modulated Resistance Mechanism to Realize Compact Passive Force-Feedback Wearable Devices

Suarez Flores, Rene Manuel	Kyoto University of Advanced Science
K C, Karan	Kyoto University of Advanced Science
Nisar, Sajid	Kyoto University of Advanced Science
Keywords: Haptics and Haptic Interfaces, Wearable Robotics, Tendon/Wire Mechanism Abstract: Passive haptic feedback devices are designed to be lightweight, compact, and easier to control than their active counterparts. However, most existing passive force feedback wearable devices rely on mechanical locking or jamming mechanisms, which are bulky and often require large external power sources such as air compressors, fluid pressure systems, or high-voltage supplies for stiffness modulation. This study introduces a new electromagnetically modulated resistance mechanism to achieve compact and efficient passive force feedback in wearable devices. The proposed system employs two electromagnets to dynamically modulate tendon tension, generating resistance forces for the fingers. This approach enables passive force feedback without bulky mechanisms, complex actuation, or intricate control strategies. To validate the effectiveness of the proposed mechanism, we developed a kinesthetic wearable device for the index finger and thumb. Experimental evaluations demonstrated that the device achieved a peak tendon locking force of 5.8N with a current of 0.1A at 9V. A user study with nine participants assessed stiffness discrimination using resistive force feedback in three tasks: index finger only, thumb only, and pinch action. The study yielded accuracy rates of 75%, 66%, and 69%, respectively. Participants found the device comfortable and easy to use, highlighting its potential for realizing compact, lightweight, and effective passive force feedback devices.

17:00-17:05, Paper WeDT20.5
Wearing a Robotic Hand to Feel 3D Force Feedback: Analysis and Virtual Reality Application of the Hand-In-Hand System

Kosanovic, Nicolas	University of Louisville
Chagas Vaz, Jean	University of Louisville
Keywords: Haptics and Haptic Interfaces, Virtual Reality and Interfaces, Multifingered Hands Abstract: Virtual Reality (VR) robot embodiment is a popular method for teleoperation and generating data to train AI control systems. A potential flaw with this approach is an incongruence in human-robot haptics. Without tactile feedback, teleoperators depend on visual cues, leading to suboptimal performance (slow movement, faulty grasps, excessive contact force). This problem is exacerbated during robot hand teleoperation. Worse, many haptic hand wearables are cable-driven and only produce unidirectional resistive force, thus failing to cover the gamut of 3D finger interactions. In this paper, a critical gap in teleoperated human-robot hand haptics is addressed by transforming an inexpensive (~500 USD) 3D-printed robotic hand into a wearable exoskeleton—the "Hand-in-Hand" (HiH) system—which provides fingertip 3D force feedback. Methods for force control, null-space optimization, and human finger pose estimation are presented and experimentally validated. Each finger of the device can produce a maximum of 1 N of force feedback in any 3D direction. HiH finger tracking is compared to an industry-grade device (MANUS Metagloves, 5000 USD) and realizes an average inferred joint position Normalized Root Mean Square Error of 33.46%. Lastly, the HiH is demonstrated within a VR robot embodiment experiment with force feedback. Operator hand-manipulation performance improved when using force feedback, emphasizing the HiH's potential for teleoperated robot control and data collection.

17:05-17:10, Paper WeDT20.6
AeroHaptix: A Wearable Vibrotactile Feedback System for Enhancing Collision Avoidance in UAV Teleoperation

Huang, Bingjian	University of Toronto
Wang, Zhecheng	University of Toronto
Cheng, Qilong	University of Toronto
Ren, Siyi	University of Toronto
Cai, Hanfeng	University of Toronto
Alvarez Valdivia, Antonio	Purdue University
Mahadevan, Karthik	University of Toronto
Wigdor, Daniel	University of Toronto
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation, Collision Avoidance Abstract: Haptic feedback enhances collision avoidance by providing directional obstacle information to operators during unmanned aerial vehicle (UAV) teleoperation. However, such feedback is often rendered via haptic joysticks, which are unfamiliar to UAV operators and limited to single-direction force feedback. Additionally, the direct coupling between the input device and the feedback method diminishes operators' sense of control and induces oscillatory movements. To overcome these limitations, we propose AeroHaptix, a wearable haptic feedback system that uses spatial vibrations to simultaneously communicate multiple obstacle directions to operators, without interfering with their input control. The layout of vibrotactile actuators was optimized via a perceptual study to eliminate perceptual biases and achieve uniform spatial coverage. A novel rendering algorithm, MultiCBF, extended control barrier functions to support multi-directional feedback. Our system evaluation showed that compared to a no-feedback condition, AeroHaptix effectively reduced the number of collisions and input disagreement. Furthermore, operators reported that AeroHaptix was more helpful than force feedback, with improved situational awareness and comparable workload.

17:10-17:15, Paper WeDT20.7
A Lightweight Haptic Interface for Hand-To-Object Tasks with Spatiotemporal Displays (I)

Fang, Yun	Shanghai Jiao Tong University
Guo, Weichao	Shanghai Jiao Tong University
Chai, Guohong	Ningbo Institute of Materials Technology and Engineering
Sheng, Xinjun	Shanghai Jiao Tong University
Keywords: Haptics and Haptic Interfaces, Wearable Robotics Abstract: Haptic interfaces create information transmission channels from machine to human with a natural way. Robust haptic interfaces with compact forms that provide high information throughput are of particular interest in industrial applications such as teleoperation, soft displays, virtual, and augmented reality. Here, we propose a haptic interface with spatiotemporal displays to convey high-dimensional information during successive hand-to-object tasks, which is architectural and psychophysical lightweight. The wearable and flexible architecture incorporates seven independently addressable vibration units at a pitch of 16 mm to allow spatiotemporal modulation. To prove the superiority of the haptic interface, four-day perceptual experiments are performed on 12 subjects with pattern discrimination tasks. We compared recognition accuracy (RA) between space-dependent and spatiotemporal strategies across the forearm, which are encoded based on finger states and morphological deformations. Furthermore, the mixed biomimetic strategy derived from both is intuitive and showed strength in terms of RA (over 88%) and information transfer (IT) (over 2.4 bit/s). With prior knowledge, are significantly better (RA over 90%) at the discrimination of sequential patterns. Moreover, the results demonstrate robust performance of the biomimetic strategy on location shift, which prove that subjects perceive relative information among vibratory units rather than absolute locations on the skin.

17:15-17:20, Paper WeDT20.8
Model-Free Energy-Based Friction Compensation for Industrial Collaborative Robots As Haptic Displays (I)

Dinc, Huseyin Tugcan	Korea Advanced Institute of Science and Technology (KAIST)
Lee, Joong-Ku	Korea Advanced Institute of Science and Technology (KAIST)
Ryu, Jee-Hwan	Korea Advanced Institute of Science and Technology
Keywords: Haptics and Haptic Interfaces, Human-Robot Collaboration, Physical Human-Robot Interaction Abstract: Collaborative robots are a promising alternative to traditional haptic displays due to their expansive workspaces, ability to generate substantial forces, and cost-effectiveness. However, they have higher levels of friction compared to conventional haptic displays, which negatively impact the precise movement and accuracy of haptic feedback, potentially causing operator fatigue. This paper proposes a novel model-free, energy-based approach for estimating and compensating the friction coefficient. Our approach calculates the time-varying impact of frictional forces during an energy cycle, defined as the period between two consecutive zero crossings of the system’s kinetic energy, based on the energy dissipated by friction. This approach directly estimates the friction coefficients, based on the chosen friction model, without requiring any prior system model information or tuning parameters. The effectiveness of our approach is demonstrated through single and multi-degree-of-freedom human interaction experiments using a Franka Emika Panda robot. The results indicate that the proposed approach outperforms state-of-the-art friction compensation methods.


WeDT21	101
SLAM and Control	Regular Session
Chair: Schmid, Lukas M.	Massachusetts Institute of Technology (MIT)
Co-Chair: Navarro-Alarcon, David	The Hong Kong Polytechnic University

16:40-16:45, Paper WeDT21.1
PGD-VIO: A Plane-Aided RGB-D Inertial Odometry with Graph-Based Drift Suppression

Zhang, Yidi	University of Chinese Academy of Sciences
Tang, Fulin	Institute of Automation, Chinese Academy of Sciences, University
Xu, Zewen	Institute of Automation, Chinese Academy of Science
Wu, Yihong	National Laboratory of Pattern Recognition, InstituteofAutomatio
Ma, Pengju	Sinopec Shengli Oilfield
Keywords: Visual-Inertial SLAM, SLAM, Localization Abstract: Generally, high-level features provide more geometrical information compared to point features, which can be exploited to further constrain motions. Planes are commonplace in man-made environments, offering an active means to reduce drift, due to their extensive spatial and temporal observability. To make full use of planar information, we propose a novel visual-inertial odometry (VIO) using an RGB-D camera and an inertial measurement unit (IMU), effectively integrating point and plane features in an extended Kalman filter (EKF) framework. Depth information of point features is leveraged to improve the accuracy of point triangulation, while plane features serve as direct observations added into the state vector. Notably, to benefit long-term navigation, a novel graph-based drift detection strategy is proposed to search overlapping and identical structures in the plane map so that the cumulative drift is suppressed subsequently. The experimental results on two public datasets demonstrate that our system outperforms state-of-the-art methods in localization accuracy and meanwhile generates a compact and consistent plane map, free of expensive global bundle adjustment and loop closing techniques.

16:45-16:50, Paper WeDT21.2
T-ESKF: Transformed Error-State Kalman Filter for Consistent Visual-Inertial Navigation

Tian, Chungeng	Harbin Institute of Technology
Hao, Ning	Harbin Institute of Technology
He, Fenghua	Harbin Institute of Technology
Keywords: Visual-Inertial SLAM, SLAM, Localization Abstract: This paper presents a novel approach to address the inconsistency problem caused by observability mismatch in visual-inertial navigation systems (VINS). The key idea involves applying a linear time-varying transformation to the error-state within the Error-State Kalman Filter (ESKF). This transformation ensures that the unobservable subspace of the transformed error-state system becomes independent of the state, thereby preserving the correct observability of the transformed system against variations in linearization points. We introduce the Transformed ESKF (T-ESKF), a consistent VINS estimator that performs state estimation using the transformed error-state system. Furthermore, we develop an efficient propagation technique to accelerate the covariance propagation based on the transformation relationship between the transition and accumulated matrices of T-ESKF and ESKF. We validate the proposed method through extensive simulations and experiments, demonstrating better (or competitive at least) performance compared to state-of-the-art methods. The code is available at github.com/HITCSC/T-ESKF.

16:50-16:55, Paper WeDT21.3
Voxel-SVIO: Stereo Visual-Inertial Odometry Based on Voxel Map

Yuan, Zikang	Huazhong University, Wuhan, 430073, China
Lang, Fengtian	Huazhong University of Science and Technology
Deng, Jie	Huazhong University of Science and Technology
Luo, Hongcheng	Xiaomi Car
Yang, Xin	Huazhong University of Science and Technology
Keywords: Visual-Inertial SLAM, Localization, Sensor Fusion Abstract: In VIO systems, allocating limited computational resources to constraints for new frames is more conducive to enhancing the accuracy of state estimation, as old frames have already undergone multiple updates. To enable VIO to efficiently index the observed map points of new frames in 3D space, this paper proposes to introduce voxel map management to the VIO field and self-develop a stereo VIO system, named Voxel-SVIO. Based on the triangulation results of feature correspondences on current stereo image, we can directly index to the recently visited voxels in 3D space. The map points in these voxels can provide enough constraints for new frames, thus are suitable to be fed into the estimator. Experimental results on three public datasets demonstrate that our Voxel-SVIO outperforms most existing state-of-the-art approaches on accuracy, and the map points selected by recently visited voxels are crucial for ensuring the performance of the proposed system. We will release the source code of this work immediately upon the acceptance of the article.

16:55-17:00, Paper WeDT21.4
Traversing Mars: Cooperative Informative Path Planning to Efficiently Navigate Unknown Scenes

Rockenbauer, Friedrich Martin	Daedalean AG
Lim, Jaeyoung	ETH Zurich
Müller, Marcus Gerhard	German Aerospace Center
Siegwart, Roland	ETH Zurich
Schmid, Lukas M.	Massachusetts Institute of Technology (MIT)
Keywords: View Planning for SLAM, Path Planning for Multiple Mobile Robots or Agents, Space Robotics and Automation Abstract: The ability to traverse an unknown environment is crucial for autonomous robot operations. However, due to the limited sensing capabilities and system constraints, approaching this problem with a single robot agent can be slow, costly, and unsafe. For example, in planetary exploration missions, the wear on the wheels of a rover from abrasive terrain should be minimized at all costs as reparations are infeasible. On the other hand, utilizing a scouting robot such as a micro aerial vehicle (MAV) has the potential to reduce wear and time costs and increase safety of a follower robot. This work proposes a novel cooperative informative path planning (IPP) framework that allows a scout (e.g., an MAV) to efficiently explore the minimum-cost-path for a follower (e.g., a rover) to reach the goal. We derive theoretical guarantees for our algorithm, and prove that the algorithm always terminates, always finds the optimal path if it exists, and terminates early when the found path is shown to be optimal or infeasible. We show in thorough experimental evaluation that the guarantees hold in practice, and that our algorithm is 22.5% quicker to find the optimal path and 15% quicker to terminate compared to existing methods.

17:00-17:05, Paper WeDT21.5
Equilibrium Compensation Based Control: A Universal Control Scheme for Systems with Mismatched Disturbances and Sensor Errors (I)

Wang, Boyi	Tsinghua University
Deng, Yang	Tsinghua University
Chen, Zhang	Tsinghua University
Liang, Bin	Tsinghua University
Keywords: Wheeled Robots, Motion Control, Robust/Adaptive Control Abstract: There are significant couplings between control techniques and disturbance rejection techniques in traditional control schemes for systems with mismatched disturbances, such that both techniques have to be designed simultaneously. In this paper, independent of specific control methods, a universal disturbance rejection scheme named equilibrium compensation based control, is proposed for general systems to attenuate mismatched disturbances and sensor errors while retaining the nominal controller's performance. In addition, several criteria are provided to analyze the observability and compensability of the disturbances and sensor errors. Simulations and experiments of tracking tasks for a single-track-two-wheeled robot with four different controllers demonstrate the effectiveness of the proposed scheme.

17:05-17:10, Paper WeDT21.6
InPTC: Integrated Planning and Tube-Following Control for Prescribed-Time Collision-Free Navigation of Wheeled Mobile Robots (I)

Shao, Xiaodong	Beihang University
Zhang, Bin	The Hong Kong Polytechnic University
Zhi, Hui	The Hong Kong Polytechnic University
Romero Velazquez, Jose Guadalupe	ITAM
Bowen, Fan	The Hong Kong Polytechnic University
Hu, Qinglei	Beihang University
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Wheeled Robots, Robust/Adaptive Control, Motion and Path Planning Abstract: In this article, we propose a novel approach, called InPTC (Integrated Planning and Tube-Following Control), for prescribed-time collision-free navigation of wheeled mobile robots in a compact convex workspace cluttered with static, sufficiently separated, and convex obstacles. A path planner with prescribed-time convergence is presented based upon Bouligand's tangent cones and time scale transformation (TST) techniques, yielding a continuous vector field that can guide the robot from almost all initial positions in the free space to the designated goal at a prescribed time, while avoiding entering the obstacle regions augmented with safety margin. By leveraging barrier functions and TST, we further derive a tube-following controller to achieve robot trajectory tracking within a prescribed time less than the planner's settling time. This controller ensures the robot moves inside a predefined ``safe tube'' around the reference trajectory, where the tube radius is set to be less than the safety margin. Consequently, the robot will reach the goal location within a prescribed time while avoiding collision with any obstacles along the way. The proposed InPTC is implemented on a Mona robot operating in an arena cluttered with obstacles of various shapes. Experimental results demonstrate that InPTC not only generates smooth collision-free reference trajectories that converge to the goal location at the preassigned time of 250,rm s (i.e., the required task completion time), but also achieves tube-following trajectory tracking with tracking accuracy higher than 0.01rm m after the preassigned time of 150,rm s. This enables the robot to accomplish the navigation task within the required time of 250,rm s.

17:10-17:15, Paper WeDT21.7
MPDG-SLAM: Motion Probability-Based 3DGS-SLAM in Dynamic Environment

Huang, Conghao	Hefei University of Technology
Zhang, Li	Hefei University of Technology
Deng, Tianchen	Shanghai Jiao Tong University
Wang, Kangxu	Tsinghua University
Li, Mingrui	Dalian University of Technology
Keywords: Audio-Visual SLAM, Data Sets for SLAM, Visual Tracking Abstract: We present MPDG-SLAM, a novel 3D Gaussian point cloud rendering SLAM method based on Motion Probability (MP) for dynamic interference handling. Current 3DGS-SLAM approaches for dynamic environments often rely on optical flow estimation masks. However, these deep learning-based optical flow models are computationally intensive and limited byprocessing speed, posing challenges for deployment on mobile devices in real-world scenarios. Moreover, existing systems depend on precise mask segmentation and corresponding loss functions for artifact removal, yet the pixel accuracy of optical flow estimation is constrained by real-world lighting conditions. To address these issues, we introduce a mobile-deployable Yolo and a mathematically derived Motion Probability (MP) attribute to label Gaussian points, which are then inversely mapped to the front-end feature tracking system to correct for dynamic object influences. By incorporating an MP-based penalty term, we directly prune dynamic Gaussians to eliminate the impact of moving objects. Additionally, we design an edge warp loss based on MP estimation, enabling accurate artifact removal even with coarse segmentation masks. The experiments show that our approach notably improves the reconstruction quality of dynamic scenes, surpassing baseline methods and reaching speeds over 30 FPS on high-end GPUs, which suggests its potential for real-time use on mobile platforms after further optimization.


WeDT22	102A
Mechanism Design 3	Regular Session
Chair: Zhang, Chuang	Shenyang Institute of Automation Chinese Academy of Sciences
Co-Chair: Yu, Haitao	Harbin Institute of Technology

16:40-16:45, Paper WeDT22.1
A Dual Tiltrotor UAV with Foldable Wings for Passive Perching and Belly/Back Takeoff

Wang, Luyao	Beihang University
Zhang, Jiangyi	Beihang University
Cheng, Liangliang	Beihang University
Yang, Jingrui	Beihang University
Ma, Tianchi	Beihang University
He, Xiang	Beihang University
Keywords: Aerial Systems: Mechanics and Control, Mechanism Design Abstract: This paper presents a novel dual tiltrotor UAV design featuring foldable wings and a strategically positioned center of gravity (CG) to enable passive perching and multi-modal flight. Traditional UAVs rely on additional mechanical components for operations such as takeoff and perching, which increase weight and complexity. Inspired by the mechanics of a balanced bird toy, our design achieves stability in horizontal flight and secure power-off perching on branches or cables. The proposed blade-tip plane-based controller facilitates belly/back takeoff without landing gear, enabling seamless transitions between hovering and horizontal flight within 2 seconds. Wind resistance tests were conducted indoors to assess disturbance rejection capabilities during perching, while the transition performance was evaluated outdoors.

16:45-16:50, Paper WeDT22.2
A Fast-Moving Underwater Wall-Climbing Robotic Fish Inspired by Rock-Climbing Fish

Qin, Hengshen	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhang, Chuang	Shenyang Institute of Automation Chinese Academy of Sciences
Tan, Wenjun	Chengdu Institute of Biology, Chinese Academy of Sciences
Wang, Ruiqian	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhang, Yiwei	Chinese Academy of Sciences
Yang, Lianchao	Shenyang Institute of Automation, Chinese Academy of Sciences
Zhang, Qi	Shenyang Institute of Automation, Chinese Academy of Sciences
Liu, Lianqing	Shenyang Institute of Automation
Keywords: Climbing Robots, Biologically-Inspired Robots, Mechanism Design Abstract: The rock-climbing fish is a benthic organism that can move rapidly and flexibly on rock surfaces in complex underwater environments. Studies have shown that this unique adhesion-sliding movement mechanism of the rock-climbing fish relies on the anisotropic friction exhibited by its sucker structure, which helps to reduce friction in the forward direction and defend against the impact of the flow field. In this work, inspired by the anisotropic friction phenomenon of the sucker of the rock-climbing fish, we designed the absorption module and pectoral and pelvic fin flapping module of the robotic fish to realize the contact switching from low friction to high friction. Meanwhile the propulsion module adopts a novel design of wire-driven caudal fin that can oscillate at high frequency (exceeding ~5 Hz). Resulting robotic fish can realize different motion modes such as adhesion-sliding movement (~0.5 BL/s) and wall-stabilized adsorption. This work will provide a new solution for the design pattern of underwater wall robots.

16:50-16:55, Paper WeDT22.3
Design of a Swimming Microrobot Powered by a Single Piezoelectric Bender

Urban, Cameron	Cornell University
King, Tyler	Cornell University
Gottlieb, Rafael	Cornell University
Gao, Hang	Cornell University
Helbling, E. Farrell	Cornell University
Keywords: Biologically-Inspired Robots, Mechanism Design, Micro/Nano Robots Abstract: Countless underwater robots seek to monitor aquatic environments while minimizing their impact on fragile ecosystems. At mm-scales, these systems can be used in a range of waterways, from shallow streams and rivers, to larger ponds and lakes, and navigate around large obstacles or through tight spaces in coral reefs, mangroves, or pipe systems. They can also be more readily used as platforms for biological study, as small-scale robots can more easily be integrated into bench-top characterization systems to verify hydrodynamic performance. Here, we present a new robotic platform, the Daniobot, a 16.5 mm body length (BL) microrobotic fish that is capable of achieving top speeds of 2.84 BL/s. At 23.8 mm total length (TL), Daniobot is, to the best of our knowledge, the smallest fish-inspired robot propelled by onboard actuators. We present the design, fabrication, and assembly of this robot as well as detailed position and velocity results at varying tail amplitudes and frequencies, and compare their trends to a simple analytical model. This design uses a single PZT bimorph actuator operating at 175 V, enabling future untethered experiments.

16:55-17:00, Paper WeDT22.4
A Novel Inflated Tube Robot with External Multi-Modal Shaping Joint Mechanism (I)

Chen, Jian	Harbin Institute of Technology
Gao, Haibo	Harbin Institute of Technology
Cheng, Tianyi	Harbin Institute of Technology
Gong, Wei	Harbin Institute of Technology
Tian, Baolin	Harbin Institute of Technology
Deng, Zongquan	Harbin Institute of Technology
Yu, Haitao	Harbin Institute of Technology
Keywords: Actuation and Joint Mechanisms, Compliant Joints and Mechanisms, Soft Robot Applications Abstract: Continuum robots recently have gained remarkable potential in exploring unstructured environments due to the merits of adaptability and flexibility. However, achieving flexible and steerable deformation in wide-range three-dimensional (3D) spaces is still challenging for continuum robots. This paper presents the design and control of a novel inflated tube robot with an external multi-modal shaping joint (MMSJ) mechanism. The MMSJ works in conjunction with an electromagnetic switching element to exhibit multiple modes including crawling, spinning, and bending alongside the tube exterior by using only two motors, guiding the inflated tube to generate 3D deformation. By developing the tailored apparatus to acquire the friction resistance in crawling mode and the torque threshold in bending mode, the proposed inflated tube robot is elaborately devised based on the parametrical analysis. Experiments on a real robot prototype demonstrate the capability of the devised robot in achieving steerable 3D deformation with the maximal bending range at 130 degrees and the fastest crawling rate of the MMSJ at 10mm/s, which endows the potential for traversing and operating in 3D diverse environments.

17:00-17:05, Paper WeDT22.5
An Ultra-Durable Piezoelectric Inertia Actuator Via Wear-Adaptive Mechanism (I)

Qiao, Guangda	Zhejiang Univ, Sch Mech Engn, State Key Lab Fluid Power & Mechat
Zhang, Yangqianhui	ZHEJIANG UNIVERSITY
Cao, Qing	Zhejiang Univ, Sch Mech Engn, State Key Lab Fluid Power & Mechat
Wang, Chaoying	Zhejiang Univ, Sch Mech Engn, State Key Lab Fluid Power & Mechat
Chen, Zhe	Zhejiang Univ, Sch Mech Engn, State Key Lab Fluid Power & Mechat
Gong, Guofang	Zhejiang University
Yang, Huayong	ZheJiang University
Han, Dong	Zhejiang University
Keywords: Actuation and Joint Mechanisms, Automation at Micro-Nano Scales, Biological Cell Manipulation Abstract: Piezoelectric inertial actuators (PIAs) enable cross-scale motion ranging from nanometers to millimeters with simple and compact structures. However, friction and wear issues limit their service life and reliability. This work proposes a cascaded magnet piezoelectric inertial actuator (CM-PIA) for durable service via a wear-adaptive mechanism. Additionally, the displacement vibration characteristic in PIAs, typically considered a defect, has been positively utilized for the first time. Through friction experiments, we demonstrated the CM-PIA's advantages in wear adaptability, modular interchangeability, and assembly adaptiveness. The actuator operated over 190 km with 2.5 μm wear and showed a 16% frictional force reduction with 4 mm wear, theoretically enabling operation for 300,000 km and enhancing the travel life by three orders of magnitude. The replacement cost for the modular magnetic actuated foot is only 4 cents, and it exhibits stable friction over an extended stroke of 400 mm. Employing three degrees of freedom (DOF) integration, we have applied the CM-PIA, which possesses a stroke up to 18 mm and a test resolution was 4 nm, in minimally invasive cellular manipulation utilizing displacement vibration by a singular core for the first time.


WeDT23	102B
Path Planning for Multiple Mobile Robots or Agents 4	Regular Session
Chair: Ren, Zhongqiang	Shanghai Jiao Tong University

16:40-16:45, Paper WeDT23.1
MAC-Planner: A Novel Task Allocation and Path Planning Framework for Multi-Robot Online Coverage Processes

Wang, Zikai	Hongkong University of Science and Technology
Lyu, Xiaoxu	Hong Kong University of Science and Technology
Zhang, Jiekai	Hong Kong University of Science and Technology
Wang, Pengyu	Hong Kong University of Science and Technology
Zhong, Yuxing	The Hong Kong University of Science and Technology
Shi, Ling	The Hong Kong University of Science and Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: This paper presents a unified framework called MAC-Planner that combines Multi-Robot Task Allocation with Coverage Path Planning to better solve the online Multi-robot coverage path planning (MCPP) problem. By dynamically assigning tasks and planning coverage paths based on the system’s real-time completion status, the planner enables robots to operate efficiently within their designated areas. This framework not only achieves outstanding coverage efficiency but also reduces conflict risk among robots. We propose a novel task allocation mechanism. This mechanism reformulates the area coverage problem into a point coverage problem by constructing a coarse map of the target coverage terrain and utilizing K-means clustering along with pairwise optimization methods to achieve efficient and equitable task allocation. We also introduce an effective coverage path planning mechanism to generate efficient coverage paths and foster robot cooperation. Extensive comparative experiments against state-of-the-art (SOTA) methods highlight MAC-Planner’s remarkable coverage efficiency and effectiveness in reducing conflict risks.

16:45-16:50, Paper WeDT23.2
Human-Robot Collaborative Minimum Time Search through Sub-Priors in Ant Colony Optimization

Gil, Oscar	Universitat Politècnica De Catalunya
Sanfeliu, Alberto	Universitat Politècnica De Cataluyna
Keywords: Path Planning for Multiple Mobile Robots or Agents, Human-Robot Collaboration, Deep Learning Methods Abstract: Human-Robot Collaboration (HRC) has evolved into a highly promising issue owing to the latest breakthroughs in Artificial Intelligence (AI) and Human-Robot Interaction (HRI), among other reasons. This emerging growth increases the need to design multi-agent algorithms that can manage also human preferences. This paper presents an extension of the Ant Colony Optimization (ACO) meta-heuristic to solve the Minimum Time Search (MTS) task, in the case where humans and robots perform an object searching task together. The proposed model consists of two main blocks. The first one is a convolutional neural network (CNN) that provides the prior probabilities about where an object may be from a segmented image. The second one is the Sub-prior MTS-ACO algorithm (SP-MTS-ACO), which takes as inputs the prior probabilities and the particular search preferences of the agents in different sub-priors to generate search plans for all agents. The model has been tested in real experiments for the joint search of an object through a Vizanti web-based visualization in a tablet computer. The designed interface allows the communication between a human and our humanoid robot named IVO. The obtained results show an improvement in the search perception of the users without loss of efficiency.

16:50-16:55, Paper WeDT23.3
Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

Tang, Yimin	University of Southern California
Yu, Zhenghong	University of Wisconsin–Madison
Li, Jiaoyang	Carnegie Mellon University
Koenig, Sven	University of Southern California
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Collision Avoidance Abstract: Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function—an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level Constraint Tree (CT) nodes and 50% low-level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

16:55-17:00, Paper WeDT23.4
D4orm: Multi-Robot Trajectories with Dynamics-Aware Diffusion Denoised Deformations

Zhang, Yuhao	University of Cambridge
Okumura, Keisuke	University of Cambridge
Woo, Heedo	University of Cambridge
Shankar, Ajay	University of Cambridge, UK
Prorok, Amanda	University of Cambridge
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Motion and Path Planning Abstract: This work presents an optimization method for generating kinodynamically feasible and collision-free multi-robot trajectories that exploits an incremental denoising scheme in diffusion models. Our key insight is that high-quality trajectories can be discovered merely by denoising noisy trajectories sampled from a distribution. This approach has no learning component, relying instead on only two ingredients: a dynamical model of the robots to obtain feasible trajectories via rollout, and a fitness function to guide denoising with Monte Carlo gradient approximation. The proposed framework iteratively optimizes a deformation for the previous trajectory with the current denoising process, allows anytime refinement as time permits, supports different dynamics, and benefits from GPU acceleration. Our evaluations for differential-drive and holonomic teams with up to 16 robots in 2D and 3D worlds show its ability to discover high-quality solutions faster than other black-box optimization methods such as MPPI. In a 2D holonomic case with 16 robots, it is almost twice as fast. As evidence for feasibility, we demonstrate zero-shot deployment of the planned trajectories on eight multirotors.

17:00-17:05, Paper WeDT23.5
Multi-Agent Combinatorial Path Finding for Tractor-Trailers in Occupancy Grids

Wu, Xuemian	Shanghai Jiao Tong University
Ren, Zhongqiang	Shanghai Jiao Tong University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Task Planning, Multi-Robot Systems Abstract: This paper investigates a problem called Multi-Agent Combinatorial Path Finding for Tractor-Trailers (MCPF-TT), which seeks collision-free paths for multiple agents from their start to goal locations, visiting a set of intermediate target locations in the middle of the paths, while minimizing the sum of arrival times. Additionally, each agent behaves like a tractor, and a trailer is attached to the agent at each intermediate target location, which increases the “body length” of that agent by one unit. Planning for those tractor-trailers in a cluttered environment introduces additional challenges, since the planner has to plan each agent in a larger state space that includes the position of the attached trailers to avoid self-collision. Furthermore, agents are more likely to collide with each other due to the increasing body lengths, and the conventional collision resolution techniques turn out to be computationally inefficient. This paper develops a new planner called CBSS-TT that includes both novel inter-agent conflict resolution techniques, and a new single-agent planner TTCA* that finds optimal single-agent path while avoiding self-collision. Our test results show that CBSS-TT sometimes requires 60% fewer number of iterations while finding solutions with cheaper costs than the baselines.

17:05-17:10, Paper WeDT23.6
WiTAH A*: Winding-Constrained Anytime Heuristic Search for a Pair of Tethered Robots

Xue, Xingjian	Northeastern University
Yong, Sze Zheng	Northeastern University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Nonholonomic Motion Planning, Motion and Path Planning Abstract: In this paper, we propose a variant of the anytime hybrid A* algorithm that generates a fast but suboptimal solution before progressively optimizing the paths to find the shortest winding-constrained paths for a pair of tethered robots under curvature constraints. Specifically, our proposed algorithm uses a tangent graph as its underlying search graph and leverages an anytime A* search framework with appropriately defined cost metrics in order to reduce the overall computation and to ensure that a winding angle constraint is satisfied. Moreover, we prove the completeness and optimality of the algorithm for finding the shortest winding-constrained paths in an anytime fashion. The effectiveness of the proposed algorithm is demonstrated via simulation experiments.

17:10-17:15, Paper WeDT23.7
LLM-Driven Hierarchical Planning: Long-Horizon Task Allocation for Multi-Robot Systems in Cross-Regional Environments

Wang, Yachao	Shandong University
Yangshuo, Dong	Shandong University
Yang, Yunting	Shandong University
Zhang, Xiang	School of Control Science and Engineering, Shandong University
Wang, Yinchuan	Shandong University
Wang, Yuhan	Shandong University
Wang, Chaoqun	Shandong University
Meng, Max Q.-H.	The Chinese University of Hong Kong
Keywords: Perception-Action Coupling, Legged Robots Abstract: Long-horizon composite task planning for multi-robot systems in cross-regional complex scenarios faces dual challenges: spatial-semantic comprehension of natural language described tasks and collaborative optimization of subtask allocation. To address these challenges, this paper proposes a progressive three-stage task planning framework. First, an augmented scene graph is constructed to enable large language models (LLMs) to comprehend environmental structures, thereby generating simplified Linear Temporal Logic (LTL) task sequences. Subsequently, a novel heuristic function is employed to select optimal task allocation plans. Finally, LLMs are used to generate low-level executable robot instructions based on robotic system instruction templates. We establish a long-horizon composite task dataset for experimental validation on real-world quadrupedal multi-robot systems. Experimental results demonstrate the effectiveness of our approach in resolving cross-regional composite tasks.

17:15-17:20, Paper WeDT23.8
Scalable MARL for Cooperative Exploration with Dynamic Robot Populations Via Graph-Based Information Aggregation

Ren, Xiaoqi	South China University of Technology
Du, Guanglong	South China University of Technology
Wang, Zhuoyao	Peng Cheng Laboratory
Dong, Xu	South China University of Technology
Wang, Xueqian	Tsinghua University
Guan, Quanlong	Jinan University
Qiu, Xiaojian	Institute for Military-Civilian Integration of Jiangxi Province
Keywords: Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning, Deep Learning Methods Abstract: This study addresses the challenge of multi-robot cooperative exploration under limited local observations in environments with dynamic robot populations. To achieve efficient area coverage within constrained timeframes, we propose the Multi-Robot Informative Planner (MIP), a novel reinforcement learning (RL)-based planning module. The core component of MIP is the Neighborhood Information Aggregator, which employs a graph neural network (GNN) to integrate local neighborhood information for each robot. Our design enhances sample efficiency by minimizing information requirements while ensuring scalability across environments with varying robot numbers. To generate high-quality, expressive neighborhood feature representations, we utilize Graphical Mutual Information (GMI) to maximize the correlation between neighboring robots' input features and their high-level hidden representations. Furthermore, MIP incorporates the Spatial-Neighborhood Transformer, which captures spatial features and inter-robot interactions through spatial self-attention mechanisms. These components collectively form the Multi-Robot Neural Informative Mapping (MRNIM) framework, outperforming traditional benchmarks in Habitat simulator.


WeDT24	102C
Computer Vision for Transportation 2	Regular Session
Chair: Xu, Yang	Zhejiang University
Co-Chair: Li, Liang	Zhejiang Univerisity

16:40-16:45, Paper WeDT24.1
Saliency-Guided Domain Adaptation for Left-Hand Driving in Autonomous Steering

Mehraban, Zahra	Queensland University of Technology
Glaser, Sebastien	INRETS/LCPC
Milford, Michael J	Queensland University of Technology
Schroeter, Ronald	Queensland University of Technology
Keywords: Computer Vision for Transportation, Transfer Learning Abstract: Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then fine-tuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.

16:45-16:50, Paper WeDT24.2
STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation

Wang, Jiamin	Shanghaitech University
Yao, Yichen	ShanghaiTech University
Feng, Xiang	Shanghai Tech University
Wu, Hang	Technical University of Munich
Wang, Yaming	Yinwang Intelligent Technology Co. Ltd
Huang, Qingqiu	Yinwang Intelligent Technology Co. Ltd
Ma, Yuexin	ShanghaiTech University
Zhu, Xinge	CUHK
Keywords: Computer Vision for Transportation, Visual Learning Abstract: The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.

16:50-16:55, Paper WeDT24.3
PB-MOT: Pose-Aware Association Boosted Online 3D Multi-Object Tracking

Pang, Bo	Zhejiang University
Xu, Yang	Zhejiang University
Chen, Jiming	Zhejiang University
Li, Liang	Zhejiang Univerisity
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Visual Tracking Abstract: Robotic and autonomous driving platforms necessitate efficient 3D Multi-Object Tracking (MOT) that harmonizes geometric precision, motion robustness, and computational efficiency. Traditional 3D MOT approaches face critical challenges: geometric similarity metrics (e.g., IoU-based) degrade at long ranges with high computational costs, while distance-based methods fail to capture object orientation and shape; the effects of occlusion and the intricate relative ego-object motion degrade tracking performance in dynamic scenes. To this end, we propose PB-MOT, an online framework integrating two key innovations: ego-motion-compensated state estimation that decouples dynamic interactions; and a rotated ellipse association algorithm unifying pose and shape-aware matching with adaptive distance constraints. Evaluations on the KITTI benchmark show that our PB-MOT achieves state-of-the-art performance with a HOTA score of 81.94%, while running at an impressive 2,402.76 FPS on CPU. This enables real-time, high-fidelity perception and tracking for resource-constrained robotic systems.

16:55-17:00, Paper WeDT24.4
LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction

Qian, Kangan	Tsinghua University
Miao, Jinyu	Tsinghua University
Luo, Ziang	TsingHua University
Fu, Zheng	Tsinghua University
Li, Jinchen	Tsinghua University
Shi, Yining	Tsinghua University
Wang, Yunlong	NIO Inc
Yang, Mengmeng	Tsinghua University
Jiang, Kun	Tsinghua University
Yang, Diange	Tsinghua University
Keywords: Computer Vision for Transportation, Autonomous Agents, Computer Vision for Automation Abstract: Accurate spatial and motion understanding is critical for autonomous driving systems. While object-level perception models excel in structured environments, they struggle with open-set categories and often lack precise geometric representation. Occupancy-based, class-agnostic methods offer better scene expressiveness but typically ignore inter-agent interactions and fail to ensure physical consistency in motion predictions, limiting their reliability in complex traffic scenarios. In this paper, we propose LEGO-Motion, a novel class-agnostic motion prediction framework that bridges the gap between instance-level reasoning and occupancy-based modeling. Unlike conventional grid-based methods that treat each cell independently, LEGO-Motion introduces two key components: (1) the Interaction-Augmented Instance Encoder (IaIE), which models interactions among dynamic agents via cross-attention, and (2) the Instance-Enhanced BEV Encoder (IeBE), which improves motion consistency across instances through multi-stage feature fusion. These components enable our model to learn semantically coherent and physically plausible motion fields. Extensive experiments on the nuScenes dataset show that LEGO-Motion achieves a around 6% improvement in motion prediction accuracy over the previous state-of-the-art, while maintaining real-time inference at 21ms. Moreover, our method demonstrates strong generalization on a proprietary FMCW LiDAR benchmark. These results validate LEGO-Motion's effectiveness in capturing both global scene structure and fine-grained motion dynamics, making it a promising foundation for next-generation perception systems.

17:00-17:05, Paper WeDT24.5
DashGaze: Driver Gaze through Dashcam

John, Thrupthi Ann	IIIT Hyderbad
Balasubramanian, Vineeth	Indian Institute of Technology, Hyderabad
Jawahar, C.V.	IIIT, Hyderabad
Keywords: Computer Vision for Transportation, Datasets for Human Motion, Intelligent Transportation Systems Abstract: Driver gaze monitoring is crucial for road safety, but existing methods rely on expensive, cumbersome technologies like wearable eye trackers or fixed-camera setups. To address this, we propose a low-cost approach using dashcams to capture driver gaze data. We introduce DashGaze, a large-scale dataset for training appearance-based gaze estimation models, featuring over 900,000 frames collected over 10 hours with 28 unique drivers. DashGaze includes synchronized views of the road, driver, and driver's egocentric perspective, along with the driver's gaze in both the driver and ego views. We also present DashGazeNet, a baseline model that generalizes well to unseen drivers and diverse conditions, achieving gaze angle errors within 6degree and gaze location errors within 225 pixels.

17:05-17:10, Paper WeDT24.6
SCORPION: Robust Spatial-Temporal Collaborative Perception Model on Lossy Wireless Network

Zhu, Ruiyang	University of Michigan
Cho, Minkyoung	University of Michigan
Zeng, Shuqing	General Motors Research and Development
Bai, Fan	GM
Mao, Morley	University of Michigan
Keywords: Computer Vision for Transportation, Networked Robots, Intelligent Transportation Systems Abstract: Collaborative Perception enables multiple agents, such as autonomous vehicles and infrastructure, to share sensor data via vehicular networks so that each agent gains an extended sensing range and better perception quality. Despite its promising benefits, realizing the full potential of such systems faces significant challenges due to inherent imperfections in underlying system layers, consisting of network layer imperfections and hardware-level noises. Such imperfections and noises include packet loss in vehicular networks, localization errors from GPS measurements, and synchronization errors caused by clock deviation and network latency. To address these challenges, we propose a novel end-to-end collaborative perception framework, SCORPION, that harnesses the AI co-design of the application layer and system layer to tackle the aforementioned imperfections. SCORPION consists of three main components: lost bird’s eye view feature reconstruction (L-BEV-R) recovers lost spatial features during lossy V2X communication, while deformable spatial cross attention (DSCA) and temporal alignment (TA) compensate for localization and synchronization errors in feature fusion. Experimental results on both synthetic and real-world collaborative 3D object detection datasets demonstrate that SCORPION advances the state-of-the-art collaborative perception methods by 5.9 - 13.2 absolute AP.

17:10-17:15, Paper WeDT24.7
Mapping in Indoor Environments Including Transparent Objects Using Stereo Polarization Camera and Projector

Ogihara, Yusuke	The University of Tokyo
Higuchi, Hiroshi	The University of Tokyo
Igaue, Takuya	The University of Tokyo
An, Qi	The University of Tokyo
Yamashita, Atsushi	The University of Tokyo
Keywords: Computer Vision for Transportation, Mapping, RGB-D Perception Abstract: This paper proposes a method for generatingmaps in indoor environments that include transparent objectsby using a stereo polarization camera and projector. Conven-tional sensors like LiDAR and stereo cameras struggle withglass, as they rely on diffuse reﬂection, while glass allows lightto pass through. In contrast, polarization cameras can measurelight polarization and estimate surface normals, enabling depthestimation by combining polarization and RGB information.However, when measuring transparent objects, reﬂected andtransmitted light cancel each other out, reducing polarizationcontrast, and the RGB information causes the depth estimationto output the depth of objects behind the glass. To address thisissue, this paper proposes a novel method that (1) imporovesthe S/N ration in polarization measument via diffuse reﬂectionon non-glass regions and (2) masks out the RGB color frompolarimetric depth estimation to not compute depth map ofobjects behind the glass to obtain depth images that includeglass surfaces. Additionally, (3) in the mapping part, depthestimation is repeated at multiple locations, and the resultsare integrated using self-localization to generate a completeenvironmental map. Experiments in an indoor environmentconﬁrmed the effectiveness of the proposed method, enablingglass-inclusive depth estimation and successful map generationon a mobile robot.

17:15-17:20, Paper WeDT24.8
RoadsideSplat: Robust 3D Gaussian Reconstruction from Monocular Roadside Surveillance

Liang, Zhaoxiang	School of Automation, Beijing Institute of Technology
Guo, Wenjun	Beijing Institute of Technology
Ren, Bohan	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Keywords: Computer Vision for Transportation, Automation Technologies for Smart Cities, Intelligent Transportation Systems Abstract: Reconstructing dynamic roads from roadside traffic surveillance cameras is crucial for smart cities and digital twin applications. While the latest monocular depth estimation methods demonstrate strong performance, they exhibit instability in roadside scenarios. Existing reconstruction approaches for autonomous driving scenes predominantly adopt vehicle-mounted perspectives, accumulating vehicle point clouds from per-frame depth maps using 3D bounding boxes. These point clouds are used to initialize the center positions and colors of 3D Gaussians to improve reconstruction performance. However, the compressed depth discrepancy between vehicles and road surfaces in roadside views leads to model confusion between vehicle and background depth estimations. To address these challenges, we propose a robust reconstruction framework based on a single fixed RGB traffic camera. Differing from conventional frame-wise depth prediction followed by 3D box-based accumulation, our method processes masked vehicle foreground sequences through existing models, directly predicting complete vehicle point clouds via local feature matching and global alignment while iteratively refining 3D boxes to enhance reconstruction quality. Leveraging the explicit nature of 3D Gaussians for scene editing, we introduce simple yet effective road constraints to mitigate penetration artifacts during scene manipulation. Extensive evaluations on the TUMTraf-V2X and RCooper datasets under monocular roadside settings validate the effectiveness of our approach.


WeDT25	103A
Legged Robots 4	Regular Session
Chair: Deng, Jie	Harbin Institute of Technology

16:40-16:45, Paper WeDT25.1
MoE-Loco: Leveraging Mixture of Experts for Multi-Task Locomotion

Huang, Runhan	Tsinghua University
Zhu, Shaoting	Tsinghua University
Du, Yilun	MIT
Zhao, Hang	Tsinghua University
Keywords: Legged Robots, AI-Based Methods, Deep Learning Methods Abstract: We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.

16:45-16:50, Paper WeDT25.2
Whole-Body Admittance Control of Anti-Saturation for Quadruped Manipulators with Impact Force Observer

Lin, Fenghao	Harbin Institute of Technology, Shenzhen
Zhang, Tianlin	Harbin Institute of Technology
Xiong, Xiaogang	Harbin Institute of Technology, Shenzhen
Lou, Yunjiang	Harbin Institute of Technology, Shenzhen
Keywords: Legged Robots, Whole-Body Motion Planning and Control, Compliance and Impedance Control Abstract: Quadruped manipulators require precise detection of external impact forces to ensure safe and compliant responses during environmental interactions. However, these systems often lack tactile sensors on their body surfaces or force/torque sensors at critical joints. This study introduces a whole-body admittance control framework for quadruped manipulators, utilizing a novel external impact force observer that estimates impact forces acting on the manipulator or the quadruped's base without relying on dedicated force sensors. The observer leverages the robustness of a super-twisting algorithm (STA) based on the momentum model of quadruped manipulators. Model uncertainties are mitigated using a low-pass filter (LPF) and compensated by ground reaction forces, significantly reducing estimation oscillations during dynamic gaits. By integrating these estimated impact forces, the whole-body admittance control framework enables compliant interactions with the environment and mitigates unsafe behaviors caused by torque saturation through a set-valued feedback loop that constrains command torques within actuation limits, including joint torque boundaries and friction cone constraints of the ground reaction force. Experimental validation across diverse scenarios confirms the effectiveness of this approach in facilitating safe and adaptive interactions between quadruped manipulators and external forces.

16:50-16:55, Paper WeDT25.3
Online Friction Coefficient Identification for Legged Robots on Slippery Terrain Using Smoothed Contact Gradients

Kim, Hajun	Korea Advanced Institute of Science and Technology
Kang, Dongyun	Korea Advanced Institute of Science and Technology
Kim, Min-Gyu	KAIST
Kim, Gijeong	Korea Advanced Institute of Science and Technology, KAIST
Park, Hae-Won	Korea Advanced Institute of Science and Technology
Keywords: Legged Robots, Contact Modeling, Calibration and Identification Abstract: This paper proposes an online friction coefficient identification framework for legged robots on slippery terrain. The approach formulates the optimization problem to minimize the sum of residuals between actual and predicted states pa rameterized by the friction coefficient in rigid body contact dy namics. Notably, the proposed framework leverages the analytic smoothed gradient of contact impulses, obtained by smoothing the complementarity condition of Coulomb friction, to solve the issue of non-informative gradients induced from the nonsmooth contact dynamics. Moreover, we introduce the rejection method to filter out data with high normal contact velocity following contact initiations during friction coefficient identification for legged robots. To validate the proposed framework, we conduct the experiments using a quadrupedal robot platform, KAIST HOUND, on slippery and nonslippery terrain. We observe that our framework achieves fast and consistent friction coefficient identification within various initial conditions.

16:55-17:00, Paper WeDT25.4
Leg State Estimation for Quadruped Robot by Using Probabilistic Model with Proprioceptive Feedback (I)

Sun, Jingyu	Shandong University
Zhou, Lelai	Shandong University
Geng, Binghou	Shandong University
Zhang, Yi	Shandong University
Li, Yibin	Shandong University
Keywords: Legged Robots, Contact Modeling Abstract: Legged robots are sent into outdoor environments and desired to explore unstructured terrains like animals in nature. Therefore, the ability of robustly detecting leg phase transition should be a critical skill. However, many current approaches rely on external sensors mounted on legged robots, which increases the overall cost or renders the robot useless if the sensors fail. Conversely, when a robot's proprioceptive sensors fail, its ability to control its motion is compromised. Therefore, as long as the robot is capable of locomotion, the proprioceptor-based leg state estimation method can be applicable. Based on this feature, we propose a novel leg phase detection method for quadruped robots that uses proprioceptive feedback to estimate leg state while overcome the problem of inaccurate in the absence of external devices. The innovative estimation method deftly identifies leg phases even in the absence of a priori terrain features, allowing the robot to traverse the terrain without prior knowledge or reliance on vision-based detection. Through extensive hardware experiments in different scenarios, the proposed approach demonstrates robust estimation of leg states.

17:00-17:05, Paper WeDT25.5
FR-Net: Learning Robust Quadrupedal Fall Recovery on Challenging Terrains through Mass-Contact Prediction

Lu, Yidan	The University of Hong Kong
Dong, Yinzhao	The University of Hong Kong
Zhang, Jiahui	The University of Hong Kong
Ma, Ji	The University of Hong Kong
Lu, Peng	The University of Hong Kong
Keywords: Legged Robots, Failure Detection and Recovery, Reinforcement Learning Abstract: Fall recovery for legged robots remains challenging, particularly on complex terrains where traditional controllers fail due to incomplete terrain perception and uncertain interactions. We present FR-Net, a learning-based framework that enables quadrupedal robots to recover from arbitrary fall poses across diverse environments. Central to our approach is a Mass-Contact Predictor network that estimates the robot's mass distribution and contact states from limited sensory inputs, facilitating effective recovery strategies. Our carefully designed reward functions ensure safe recovery even on steep stairs without dangerous rolling motions common to existing methods. Trained entirely in simulation using privileged learning, our framework guides policy learning without requiring explicit terrain data during deployment. We demonstrate the generalization capabilities of FR-Net across different quadrupedal platforms in simulation and validate its performance through extensive real-world experiments on the Go2 robot in 10 challenging scenarios. Our results indicate that explicit mass-contact prediction is key to robust fall recovery, offering a promising direction for generalizable quadrupedal skills.

17:05-17:10, Paper WeDT25.6
TRG-Planner: Traversal Risk Graph-Based Path Planning in Unstructured Environments for Safe and Efficient Navigation

Lee, Dongkyu	KAIST
Nahrendra, I Made Aswin	KAIST
Oh, Minho	KAIST
Yu, Byeongho	URobotics Corp
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Legged Robots, Field Robots, Motion and Path Planning Abstract: Unstructured environments such as mountains, caves, construction sites, or disaster areas are challenging for autonomous navigation because of terrain irregularities. In particular, it is crucial to plan a path to avoid risky terrain and reach the goal quickly and safely. In this paper, we propose a method for safe and distance-efficient path planning, leveraging Traversal Risk Graph~(TRG), a novel graph representation that takes into account geometric traversability of the terrain. TRG nodes represent stability and reachability of the terrain, while edges represent relative traversal risk-weighted path candidates. Additionally, TRG is constructed in a wavefront propagation manner and managed hierarchically, enabling real-time planning even in large-scale environments. Lastly, we formulate a graph optimization problem on TRG that leads the robot to navigate by prioritizing both safe and short paths. Our approach demonstrated superior safety, distance efficiency, and fast processing time compared to the conventional methods. It was also validated in several real-world experiments using a quadrupedal robot. Notably, TRG-planner contributed as the global path planner of an autonomous navigation framework for the DreamSTEP team, which won the Quadruped Robot Challenge at ICRA 2023. The project page is available at https://trg-planner.github.io.

17:10-17:15, Paper WeDT25.7
Non-Gaited Legged Locomotion with Monte-Carlo Tree Search and Supervised Learning

Taouil, Ilyass	Technical University of Munich (TUM)
Amatucci, Lorenzo	Istituto Italiano Di Tecnologia
Khadiv, Majid	Technical University of Munich
Dai, Angela	Technical University of Munich
Barasuol, Victor	Istituto Italiano Di Tecnologia
Turrisi, Giulio	Istituto Italiano Di Tecnologia
Semini, Claudio	Istituto Italiano Di Tecnologia
Keywords: Legged Robots, Machine Learning for Robot Control, Optimization and Optimal Control Abstract: Legged robots are able to navigate complex terrains by continuously interacting with the environment through careful selection of contact sequences and timings. However, the combinatorial nature behind contact planning hinders the applicability of such optimization problems on hardware. In this work, we present a novel approach that optimizes gait sequences and respective timings for legged robots in the context of optimization-based controllers through the use of sampling-based methods and supervised learning techniques. We propose to bootstrap the search by learning an optimal value function in order to speed-up the gait planning procedure making it applicable in real-time. To validate our proposed method, we showcase its performance both in simulation and on hardware using a 22 kg electric quadruped robot. The method is assessed on different terrains, under external perturbations, and in comparison to a standard control approach where the gait sequence is fixed a priori.

17:15-17:20, Paper WeDT25.8
Fear-Based Behavior Adaptation for Robust Walking Robots Using Unsupervised Health Estimation

Schnell, Tristan	FZI Forschungszentrum Informatik
Grosse Besselmann, Marvin	FZI Forschungszentrum Informatik
Eichmann, Christian	FZI Research Center for Information Technology
Roennau, Arne	Karlsruhe Institute of Technology (KIT)
Dillmann, Rüdiger	FZI - Forschungszentrum Informatik - Karlsruhe
Keywords: Legged Robots, Failure Detection and Recovery, Field Robots Abstract: Mobile robots can perform increasingly impressive feats in controlled environments. Many real applications, though, especially for walking robots, introduce a high degree of unforeseen difficulties, yet require very robust robot operation. In these cases, it is still often not possible to guarantee the needed reliability. We present an approach to utilize unsupervised anomaly detection to implement a fear-based adaptation of robot behavior. This allows robots to automatically and quickly react to any type of unexpected problems. Neither the environment nor the type of disturbance has to be known beforehand, as the system requires only a small amount of baseline data for training, which can be collected in a laboratory environment. Additionally, it can work on arbitrary robot hardware and be integrated in all types of robot control structures. We evaluated our approach in simulation and on state of the art walking robots, ANYmal, Spot and our own six-legged walking robot prototype, in a realistic field test environment in the Tabernas desert in Spain. Our results showcase that we can quickly detect arbitrary problems based on significantly different types of sensor data and decrease robot fall rates in the most extreme scenarios from 56% to 4%. This promises significant increases in robustness for all types of walking robots in highly challenging and previously unknown environments.


WeDT26	103B
Localization 4	Regular Session
Chair: Yu, Hongshan	National Engineering Laboratory for Robot Visual Perception and Control Technology, Hunan University
Co-Chair: Fischer, Tobias	Queensland University of Technology

16:40-16:45, Paper WeDT26.1
Single-Beacon Localization for Mobile Robot: A Set Membership Filtering Approach

Qin, Xujie	National University of Defense Technology
Cong, Yirui	National University of Defense Technology
Lai, Jun	National University of Defense Technology
Yang, Jinyi	National University of Defense Technology
Wang, Xiangke	National University of Defense Technology
Keywords: Localization, Range Sensing, Sensor Fusion Abstract: In this letter, we study the localization problem of a mobile robot with range measurement from a single beacon. Previous filtering-based studies usually required accurate statistics of noises to be theoretically sound and reliable, which is difficult to obtain in practical systems. To solve this problem, we propose an accurate and efficient localization method based on a set-membership filtering framework with constrained zonotopes. This method includes three novel steps. First, we design a convex optimization relaxation method to handle the non-convexity caused by the single-beacon range measurement. Then, a halfspace-intersection refinement is proposed which improves the estimation accuracy. Finally, we provide a sliding-window recursive method that simultaneously guarantees the accuracy and the efficiency. Simulations and field experiments corroborate the effectiveness of our proposed method.

16:45-16:50, Paper WeDT26.2
High-Performance Relative Localization Based on Key-Node Seeking Considering Aerial Drags Using Range and Odometry Measurements (I)

Chen, Sijia	Shanghai Jiao Tong University
Li, Yuzhu	Shanghai Jiao Tong University
Dong, Wei	Shanghai Jiao Tong University
Keywords: Localization, Range Sensing, Sensor Fusion Abstract: Using an inertial measurement unit and a single ultra-wideband radio can provide effective relative localization only if the system observability is guaranteed. To ensure observability, recent studies utilize extended sliding window filter for state augmentation, selecting key-nodes as replacements for continuous states. However, as the velocity measurements are obtained indirectly through pre-integration, integrating over key-nodes far apart could lead to velocity divergence. To tackle this issue, this work proposes the reformulated key-node seeking approach considering aerial drag effects, which inherently reflects the dissipative nature and enhances system observability and estimation precision. Considering the potential velocity divergence, aerial drag effect is extended to reformulate the process model and observability matrix. Then, to further eliminate a possible ill-conditioned observability matrix, an index naturally similar to the condition number of the reconstructed matrix is proposed to measure the degree of observability. Furthermore, the key-node selection strategy is developed, selecting the optimal measurements regarding the least ill-conditioned requirement as optimization. Thus, the selected key-nodes can achieve maximum observability within each sliding window. Finally, validated in single-agent, homogeneous and heterogeneous multiple-agent systems, the proposed method steadily reduces estimation RMSE and the condition number compared with the state-of-art methods, showing its effectiveness.

16:50-16:55, Paper WeDT26.3
Matched Filtering Based LiDAR Place Recognition for Urban and Natural Environments

Joseph, Therese	Queensland University of Technology
Fischer, Tobias	Queensland University of Technology
Milford, Michael J	Queensland University of Technology
Keywords: Localization, Range Sensing Abstract: Place recognition is an important task within autonomous navigation, involving the re-identification of previously visited locations from an initial traverse. Unlike visual place recognition (VPR), LiDAR place recognition (LPR) is tolerant to changes in lighting, seasons, and textures, leading to high performance on benchmark datasets from structured urban environments. However, there is a growing need for methods that can operate in diverse environments with high performance and minimal training. In this paper, we propose a handcrafted matching strategy that performs roto-translation invariant place recognition and relative pose estimation for both urban and unstructured natural environments. Our approach constructs Birds Eye View (BEV) global descriptors and employs a two-stage search using matched filtering --- a signal processing technique for detecting known signals amidst noise. Extensive testing on the NCLT, Oxford Radar and WildPlaces datasets consistently demonstrates state-of-the-art (SoTA) performance across place recognition and relative pose estimation metrics, with up to 15% higher recall than previous SoTA.

16:55-17:00, Paper WeDT26.4
Wideband USBL Localization by RANSAC-Type Linear Fitting (I)

Xuan, Li	Institute of Acoustics, Chinese Academy of Sciences
Chengpeng, Hao	Institute of Acoustics, Chinese Academy of Sciences
Shefeng, Yan	Institute of Acoustics, Chinese Academy of Sciences
Keywords: Localization, Range Sensing Abstract: Underwater acoustic positioning using ultra-short baseline (USBL) technology is essential for underwater navigation and ocean surveillance. Many research institutions and commercial organizations have conducted extensive studies on USBL, with wideband processing widely applied to enhance the capabilities of related methods. However, the properties of the wideband correlation spectrum have not been thoroughly explored. This study introduces a linear fitting technique, leveraging the frequency-phase relationship within the spectrum to enhance USBL performance. By combining iterative reweighted least squares and a RANSAC-type approach, the method reduces the impact of outliers, particularly in multiplepath and low SNR scenarios. Furthermore, we operate a matched filter before the linear fitting process and suggest re-screening based on the consistency of all array elements in the frequency domain to eliminate more outliers. The proposed method significantly reduces positioning errors in both numerical simulations and sea trial data processing.

17:00-17:05, Paper WeDT26.5
MR-ULINS: A Tightly-Coupled UWB-LiDAR-Inertial Estimator with Multi-Epoch Outlier Rejection

Zhang, Tisheng	Wuhan University
Yuan, Man	Wuhan University
Wei, Linfu	Wuhan University
Wang, Yan	Wuhan University
Tang, Hailiang	Wuhan University
Niu, Xiaoji	Wuhan University
Keywords: Localization, SLAM, Range Sensing Abstract: The LiDAR-inertial odometry (LIO) and the ultrawideband (UWB) have been integrated to achieve driftless positioning in global navigation satellite system (GNSS)-denied environments. However, the UWB may be affected by systematic range errors (such as the clock drift and the antenna phase center offset) and non-line-of-sight (NLOS) signals, resulting in reduced robustness. In this study, we propose a UWB-LiDAR-inertial estimator (MR-ULINS) that tightly integrates the UWB range, LiDAR frame-to-frame, and IMU measurements within the multi-state constraint Kalman filter (MSCKF) framework. The systematic range errors are precisely modeled to be estimated and compensated online. Besides, we propose a multi-epoch outlier rejection algorithm for UWB NLOS by utilizing the relative accuracy of the LIO. Specifically, the relative trajectory of the LIO is employed to verify the consistency of all range measurements within the sliding window. Extensive experiment results demonstrate that MR-ULINS achieves a positioning accuracy of around 0.1 m in complex indoor environments with severe NLOS interference. Ablation experiments show that the online estimation and multi-epoch outlier rejection can effectively improve the positioning accuracy. Besides, MR-ULINS maintains high accuracy and robustness in LiDAR-degenerated scenes and UWB-challenging conditions with spare base stations.

17:05-17:10, Paper WeDT26.6
A LiDAR Odometry with Multi-Metric Feature Association and Contribution Constraint Selection

Li, Nuo	Southeast University
Yao, Yiqing	Southeast University
Xu, Xiaosu	Southeast University
Wang, Zijian	Southeast University
Keywords: Localization, SLAM, Range Sensing Abstract: LiDAR-based simultaneous localization and mapping (SLAM) is crucial for achieving accurate pose estimation and map generation, thus serving as a foundational technology in the advancement of autonomous driving systems. In this letter, we introduce an accurate and robust feature-based LiDAR odometry method. Initially, we propose a feature extraction method centered on local extreme points, which capitalizes on the structural characteristics of local regions in LiDAR scans. Secondly, we purpose a multi-metric feature association approach for keyframe registration. This method leverages sparse and abstract geometric primitives to improve the accuracy and speed of keyframe matching. Additionally, Considering the varying impact of different metric features on pose constraints, an constraint contribution selection method is introduced to identify the most valuable features within the multi-metric feature set. Finally, the performance and efficiency of the proposed method are evaluated on the public KITTI, M2DGR, and The Newer College dataset, as well as our collected campus dataset. Experimental results demonstrate that the proposed method exhibits comparable performance compared to state-of-the-art LiDAR odometry methods across various scenarios.

17:10-17:15, Paper WeDT26.7
DWE-Based SRIBO: An Efficient and Resilient Single-Range and Inertial Based Odometry with Dimension-Reduced Wriggling Estimator (I)

Dong, Wei	Shanghai Jiao Tong University
Chen, Sijia	Shanghai Jiao Tong University
Mei, Zheyuan	Shanghai Jiao Tong University
Ying, Yuanjiong	Shanghai Jiao Tong University
Zhu, Xiangyang	Shanghai Jiao Tong University
Keywords: Localization, Sensor Fusion, Range Sensing Abstract: The single-range and inertial-based positioning approach is a low-cost and lightweight solution for multirotor flying robots in challenging scenarios where visual perception degradation occurs. Unfortunately, it may still encounter local observability issues in certain motion patterns. In this study, we address this limitation by proposing a computationally efficient dimension-reduced wriggling estimator, which enhances the positioning performance and robustness in an expanded estimation horizon. This estimator slides the estimation horizon with output matrices approximation based on adjacent historical estimation sequences. It then reduces the computational complexity through dimension reduction using a polynomial approximation approach. The dimension-reduced wriggling estimator enables each estimation to cover a sufficiently long interval with linear programming, thereby enhancing the degree of observability and robustness by leveraging adequate measurement redundancy. Additionally, this estimator can incorporate additional sensors, such as an optical flow sensor. We also prove the theoretical convergence and numerical stability of the position estimator. Indoor and outdoor experiments validate that the proposed singlerange and inertial-based system achieves decimeter-level precision at a high frequency of hundreds of Hertz while remaining resilient to sensor failures. Particularly, this positioning system demonstrates excellent robustness in dark, smoky, and blinking scenarios, where commonly used optical sensors fail to provide consistent estimation.

17:15-17:20, Paper WeDT26.8
MRMT-PR: A Multi-Scale Reverse-View Mamba-Transformer for LiDAR Place Recognition

Luo, Kan	Hnu. ; Csnu
Wang, Jingwen	Hunan University
Yu, Hongshan	National Engineering Laboratory for Robot Visual Perception And
Wang, Yaonan	Hunan University
Civera, Javier	Universidad De Zaragoza
Chen, Xieyuanli	National University of Defense Technology
Keywords: Localization, Recognition Abstract: Place recognition is a fundamental technology of high relevance for autonomous robot navigation. Existing methods encounter significant challenges arising from scene variations (e.g., illumination changes, dynamic objects), viewpoint shifts, and difficulties in data fusion and alignment. These factors often lead to a substantial drop in recognition recall, which is typically addressed in the literature by training deep neural networks to learn invariant feature representations. In this paper, we propose MRMT-PR, a novel multi-scale reverse-view Mamba-Transformer architecture for LiDAR-based place recognition that uses a single-frame point cloud as its input. Our MRMT-PR framework consists of a multi-scale reverse-view preprocessing module for LiDAR point clouds, a Mamba-Transformer feature encoder, and a global feature fusion module. This architecture effectively mitigates the impact of perspective and illumination variations, enhances the global representational capacity of LiDAR features, and significantly improves recognition robustness under challenging conditions such as viewpoint changes and long-term localization. Experiments conducted on NCLT dataset with challenging scenarios demonstrate that MRMT-PR outperforms existing LiDAR-based place recognition baselines in terms of overall performance.


WeDT27	103C
Planning, Scheduling and Coordination 2	Regular Session
Chair: Wan, Weiwei	Osaka University
Co-Chair: Harada, Kensuke	Osaka University

16:40-16:45, Paper WeDT27.1
Reinforcement Learning-Based Scheduling for Dual-Arm Cluster Tool with Multifunctional Process Modules

Liu, LangJin	Guangdong University of Technology
Zhu, Qinghua	Guangdong University of Technology
Liang, Weixin	Guangdong University of Technology
Hou, Yan	Guangdong University of Technology
Keywords: Planning, Scheduling and Coordination, Semiconductor Manufacturing, Reinforcement Learning Abstract: Cluster tools are vital in semiconductor manufacturing, where multifunctional process modules (MPMs) enhance flexibility and efficiency. However, variable MPMs and processing time in dual-arm cluster tools (DACTs) complicate scheduling, as variable MPM allocation patterns yield distinct productivity. This paper proposes a reinforcement learning-based method for DACTs with MPMs. Firstly, an algorithm enumerates all valid MPM allocation patterns. Then, an adaptive deep Q-Network (DQN) with masking techniques efficiently selects the most efficient pattern and generates robot schedules, minimizing makespan and wafer post-processing residency time across diverse DACT configurations. Experiments validate the proposed approach that offers robust, flexible scheduling solutions to boost semiconductor manufacturing productivity.

16:45-16:50, Paper WeDT27.2
Focus Bug: Learning Environmental Awareness for Efficient Mapless Navigation

Dansereau, Charles	Polytechnique Montreal
Duisterhof, Bardienus P	Carnegie Mellon University
Nicolescu, Gabriela	Polytechnique Montreal
Keywords: Reactive and Sensor-Based Planning, Micro/Nano Robots, Reinforcement Learning Abstract: Tiny robots such as nano quadcopters or micro rovers are highly beneficial for various applications as they are inexpensive, agile, and safe for humans. However, their extreme size, weight, and power (SWAP) constraints lead to extremely limited compute, making autonomous navigation challenging. Existing approaches have enabled navigation within these tight constraints, but struggle in dynamic and cluttered scenes. To this end, we present Focus Bug, a novel and robust mapless navigation algorithm that can run on extremely limited hardware. Focus Bug reduces the amount of processed sensory data using a tiny reinforcement learning policy, only processing the inputs necessary for navigation. We use deep reinforcement learning (DRL) to identify critical parts of the robot's range data and combine it with classical mapless navigation methods to benefit from their robustness and established performance. We implement and evaluate Focus Bug both on a drone in simulation and a micro-rover in the real world to show it can be applied across embodiments. Our hybrid approach outperforms the state-of-the-art in DRL navigation (57% less collisions in dynamic environments) while reducing the amount of range data processed by 87%, and achieving a 2.6X improvement in processing time compared to classical methods. Focus bug is the first method to achieve the high success rate of robust methods (97%) within such a tight compute budget.

16:50-16:55, Paper WeDT27.3
HEATS: A Hierarchical Framework for Efficient Autonomous Target Search with Mobile Manipulators

Zhang, Hao	Harbin Institute of Technology, Shenzhen
Wang, Yifei	Harbin Institute of Technology, Shenzhen
Zhang, Weifan	Harbin Institute of Technology, Shenzhen
Wang, Yu	University of Science and Technology of China
Chen, Haoyao	Harbin Institute of Technology, Shenzhen
Keywords: Reactive and Sensor-Based Planning, Search and Rescue Robots, Motion and Path Planning Abstract: Utilizing robots for autonomous target search in complex and unknown environments can greatly improve the efficiency of search and rescue missions. However, existing methods have shown inadequate performance due to hardware platform limitations, inefficient viewpoint selection strategies, and conservative motion planning. In this work, we propose HEATS, which enhances the search capability of mobile manipulators in complex and unknown environments. We design a target viewpoint planner tailored to the strengths of mobile manipulators, ensuring efficient and comprehensive viewpoint planning. Supported by this, a whole-body motion planner integrates global path search with local IPC optimization, enabling the mobile manipulator to safely and agilely visit target viewpoints, significantly improving search performance. We present extensive simulated and real-world tests, in which our method demonstrates reduced search time, higher target search completeness, and lower movement cost compared to classic and state-of-the-art approaches. Our method will be open-sourced for community benefit.

16:55-17:00, Paper WeDT27.4
Assembly Sequence Planning Considering Robotic Motion Costs and Multi-Operation Constraints

Nagai, Haruto	Osaka University
Wan, Weiwei	Osaka University
Suemoto, Hiroki	Kawasaki Heavy Industry, Ltd
Masaoka, Kouichi	Kawasaki Heavy Industry, Ltd
Harada, Kensuke	Osaka University
Keywords: Planning, Scheduling and Coordination, Intelligent and Flexible Manufacturing, Factory Automation Abstract: In assembly tasks, multiple operations, such as positioning, snap-fitting, and screw fastening are often required for a single workpiece. The multiple operations add complexity to the planning process. To address this challenge, we propose an assembly sequence planning method that considers the combination of multiple operations associated with each workpiece. We define the sequence of these operations as a “workflow” and search for an optimal assembly sequence while respecting the workflow constraints of the workpieces. Beyond handling multioperation constraints, our method optimizes the robot’s motion costs by assigning weights to the search tree and minimizing these costs accordingly. To evaluate the effectiveness of the proposed approach, we compare assembly planning results for multi-operation tasks with and without workflow decomposition. Additionally, we analyze the influence of motion cost minimization on planning performance and computational efficiency. Experimental results verified the effectiveness of the proposed method in improving assembly planning efficiency.

17:00-17:05, Paper WeDT27.5
Learning to Solve the Multi-Agent Task Assignment Problem for Automated Data Centers

Loiodice, Christelle	NAVER LABS Europe
Michel, Sofia	NAVER LABS Europe
Drakulic, Darko	NAVER LABS Europe
Andreoli, Jean-Marc	NAVER LABS Europe
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems, Deep Learning Methods Abstract: We consider a large-scale data center where a fleet of heterogeneous mobile robots and human workers collaborate to handle various installation and maintenance tasks. We focus on the underlying multi-agent task assignment problem which is crucial to optimize the overall system. We formalize the problem as a Markov Decision Process and propose an end-to-end learning approach to solve it. We demonstrate the effectiveness of our approach in simulation with realistic data and in the presence of uncertainty.

17:05-17:10, Paper WeDT27.6
Encoding Robot Behavior As Sensory-Based Adaptation of Learned Skillful Trajectories

Madera, Jonathan	University of Texas at Austin
Varveropoulos, Leonidas Giorgos	University of Texas at Austin
Majewicz Fey, Ann	The University of Texas at Austin
Keywords: Reactive and Sensor-Based Planning, Motion Control, Medical Robots and Systems Abstract: The imitation learning paradigm is a systematic approach for encoding intelligent behaviors into robotic systems. While a model representation of the ideal task behavior can be learned by processing a set of human demonstrations, learning a modeling representation that can generalize the desired behavior to perform well in dynamic environments with human collaborators is an open challenge. To address this problem, we encode intelligent robot behavior as a combination of a popular learned baseline control policy (Gaussian Mixture Model, GMM) with reactive control policies that activate based on triggers from online sensory information during task execution. Two contributions encapsulate the approach: an iterative algorithm to combine the learned and reactive policies and examples for mapping sensory information into desired robot reactive behaviors. The proposed approach was implemented on a bi-manual surgical robot and evaluated on how well the combined control policy balanced the behavioral constraints imposed during a collision avoidance and compliance tasks. Successful dynamic collision avoidance results and compliance responses that reduce environmental forces on the manipulator support the use of this paradigm for designing intelligent robot behaviors which can complement learned models to program complex robot behaviors that can balance task performance in scenarios with human collaborators.


WeDT28	104
Cognitive Robotics	Regular Session
Chair: Guerrero Rosado, Oscar	Donders Institute for Brain, Cognition and Behaviour–Radboud Universiteit, Nijmegen, the Netherlands
Co-Chair: Bian, Gui-Bin	Institute of Automation, Chinese Academy of Sciences

16:40-16:45, Paper WeDT28.1
Motivational Cognitive Maps Allow Robot Biomimetic Autonomy

Guerrero Rosado, Oscar	Donders Institute for Brain, Cognition and Behaviour–Radboud Uni
F. Amil, Adrián	Donders Institute for Brain, Cognition and Behaviour–Radboud Uni
T. Freire, Ismael	Institute of Intelligent Systems and Robotics, Sorbonne Universi
Vinck, Martin	Donders Institute for Brain, Cognition and Behaviour–Radboud Uni
F.M.J. Verschure, Paul	Department of Health Psychology, Universidad Miguel Hernández De
Keywords: Neurorobotics, Reinforcement Learning, Cognitive Control Architectures Abstract: The mammalian hippocampal formation plays a critical role in efficient and flexible navigation. Hippocampal place cells exhibit spatial tuning, characterized by increased firing rates when an animal occupies specific locations in its environment. However, the mechanisms underlying the encoding of spatial information by hippocampal place cells remain not fully understood. Evidence suggests that spatial preferences are shaped by multimodal sensory inputs. Yet, existing hippocampal models typically rely on a single sensory modality, overlooking the role of interoceptive information in the formation of cognitive maps. In this paper, we introduce the Motivational Hippocampal Autoencoder (MoHA), a biologically inspired model that integrates interoceptive (motivational) and exteroceptive (visual) information to generate motivationally modulated cognitive maps. MoHA captures key hippocampal firing properties across different motivational states and, when embedded in a reinforcement learning agent, generates adaptive internal representations that drive goal-directed foraging behavior. Grounded in the principle of biological autonomy, MoHA enables the agent to dynamically adjust its navigation strategies based on internal drives, ensuring that behavior remains flexible and context-dependent. Our results show the benefits of integrating motivational cognitive maps into artificial agents with a varying set of goals, laying the foundation for self-regulated multi-objective reinforcement learning.

16:45-16:50, Paper WeDT28.2
Human-Inspired Soft Anthropomorphic Hand System for Neuromorphic Object and Pose Recognition Using Multimodal Signals

Wang, Fengyi	Technical University of Munich
Fu, Xiangyu	Technical University of Munich
Thakor, Nitish V.	Johns Hopkins University, Baltimore, USA
Cheng, Gordon	Technical University of Munich
Keywords: Neurorobotics, Soft Robot Applications, Bioinspired Robot Learning Abstract: The human somatosensory system integrates multimodal sensory feedback, including tactile, proprioceptive, and thermal signals, to enable comprehensive perception and effective interaction with the environment. Inspired by the biological mechanism, we present a sensorized soft anthropomorphic hand equipped with diverse sensors designed to emulate the sensory modalities of the human hand. This system incorporates biologically inspired encoding schemes that convert multimodal sensory data into spike trains, enabling highly efficient processing through Spiking Neural Networks (SNNs). By utilizing these neuromorphic signals, the proposed framework achieves 97.14% accuracy in object recognition across varying poses, significantly outperforming previous studies on soft hands. Additionally, we introduce a novel differentiator neuron model to enhance material classification by capturing dynamic thermal responses. Our results demonstrate the benefits of multimodal sensory fusion and highlight the potential of neuromorphic approaches for achieving efficient, robust, and human-like perception in robotic systems.

16:50-16:55, Paper WeDT28.3
High-Precision Tracking of Time-Varying Trajectories for Microsurgical Robots in Constrained Environments

Zhai, Yu-Peng	Taiyuan University of Technology
Bian, Gui-Bin	Institute of Automation, Chinese Academy of Sciences
Li, Zhen	Institute of Automation, Chinese Academy of Sciences
Ye, Qiang	Institute of Automation, Chinese Academy of Sciences
Deng, Tianqi	Institute of Automation, Chinese Academy of Sciences
Zhang, Ming-Yang	Institute of Automation, Chinese Academy of Sciences
Fu, Pan	Beijing Institute of Technology
He, Wenhao	Institute of Automation, Chinese Academy of Sciences
Deng, Yawen	Beijing Institute of Technology
Keywords: Neural and Fuzzy Control, Motion Control, Human Factors and Human-in-the-Loop Abstract: This research addresses the challenge of achieving high-precision tracking of time-varying trajectories under nonlinear disturbances and motion constraints in microsurgical robots. A hybrid control framework integrating fuzzy adaptive sliding mode control with radial basis function neural networks is proposed. This framework dynamically adjusts the sliding mode gain to suppress high-frequency jitter and compensate for unmodeled disturbances such as joint friction and tissue contact forces. Experiments conducted on a self-developed microscopic ophthalmic robot platform demonstrated that the trajectory tracking error was reduced to 1.1 μm, representing improvements of 85.9%, 76.1%, and 66.7% compared to PID control, sliding mode control and non-singular fast terminal sliding mode control respectively. The tracking delay was 19 milliseconds. In experiments on living pigs with central retinal artery occlusion, the system successfully performed intravascular injection, with a maximum error of 3.97 μm. This solution, through optimization via fuzzy logic and neural networks, achieves micron-level precision and robustness, effectively solving high-frequency control noise and low frequency environmental disturbances, ensuring both the accuracy and safety of the microsurgical robot.

16:55-17:00, Paper WeDT28.4
NeuroVE: Brain-Inspired Linear-Angular Velocity Estimation with Spiking Neural Networks

Li, Xiao	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Guo, Ruibin	National University of Defense Technology
Wu, Yujie	Hong Kong Polytechnic University
Zhou, Zongtan	National University of Defense Technology
Yu, Fangwen	Tsinghua University
Lu, Huimin	National University of Defense Technology
Keywords: Neurorobotics, Biologically-Inspired Robots, SLAM Abstract: Vision-based ego-velocity estimation is a fundamental problem in robot state estimation. However, the constraints of frame-based cameras, including motion blur and insufficient frame rates in dynamic settings, readily lead to the failure of conventional velocity estimation techniques. Mammals exhibit a remarkable ability to accurately estimate their ego-velocity during aggressive movement. Hence, integrating this capability into robots shows great promise for addressing these challenges. In this paper, we propose a brain-inspired framework for linear-angular velocity estimation, dubbed NeuroVE. The NeuroVE framework employs an event camera to capture the motion information and implements spiking neural networks (SNNs) to simulate the brain's spatial cells' function for velocity estimation. We formulate the velocity estimation as a time-series forecasting problem. To this end, we design an Astrocyte Leaky Integrate-and-Fire (ALIF) neuron model to encode continuous values. Additionally, we have developed an Astrocyte Spiking Long Short-term Memory (ASLSTM) structure, which significantly improves the time-series forecasting capabilities, enabling an accurate estimate of ego-velocity. Results from both simulation and real-world experiments indicate that NeuroVE has achieved an approximate 60% increase in accuracy compared to other SNN-based approaches.

17:00-17:05, Paper WeDT28.5
User Experience Estimation in Human-Robot Interaction Via Multi-Instance Learning of Multimodal Social Signals

Miyoshi, Ryo	CyberAgent
Okafuji, Yuki	CyberAgent, Inc
Iwamoto, Takuya	CyberAgent
Nakanishi, Junya	Osaka Univ
Baba, Jun	CyberAgent, Inc
Keywords: Multi-Modal Perception for HRI, Gesture, Posture and Facial Expressions Abstract: In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users' states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.


WeDT29	105
SLAM: Sensing and Mapping	Regular Session
Chair: Yuan, Yijun	University of Wuerzburg
Co-Chair: Hu, Liang	Harbin Institute of Technology, Shenzhen

16:40-16:45, Paper WeDT29.1
MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Hu, Xiangcheng	Hong Kong University of Science and Technology
Wu, Jin	University of Science and Technology Beijing
Jia, Mingkai	The Hong Kong University of Science and Technology
Yan, Hongyu	Hong University of Science and Technology
Jiang, Yi	City University of Hong Kong
Jiang, Binqian	Hong Kong University of Science and Technology
Zhang, Wei	HKUST
He, Wei	University of Science and Technology Beijing
Tan, Ping	The Hong Kong University of Science and Technology
Keywords: SLAM, Mapping, Performance Evaluation and Benchmarking Abstract: Evaluating massive-scale point cloud maps in Simultaneous Localization and Mapping (SLAM) still remains challenging due to three limitations: lack of unified standards, poor robustness to noise, and computational inefficiency. We propose MapEval, a novel framework for point cloud map assessment. Our key innovation is a voxelized Gaussian approximation method that enables efficient Wasserstein distance computation while maintaining physical meaning. This leads to two complementary metrics: Voxelized Average Wasserstein Distance (texttt{texttt{texttt{AWD}}}) for global geometry and Spatial Consistency Score (texttt{SCS}) for local consistency. Extensive experiments demonstrate that MapEval achieves SI{100}{}-SI{500}{} times speedup while maintaining evaluation performance compared to traditional metrics like Chamfer Distance (texttt{CD}) and Mean Map Entropy (texttt{MME}). Our framework shows robust performance across both simulated and real-world datasets with million-scale point clouds. The MapEval libraryfootnote{texttt{https://github.com/JokerJohn/Cloud_Map_Evalu ation}} will be publicly available to promote map evaluation practices in the robotics community.

16:45-16:50, Paper WeDT29.2
UA-MPC: Uncertainty-Aware Model Predictive Control for Motorized LiDAR Odometry

Li, Jianping	Nanyang Technological University
Xu, Xinhang	Nanyang Technological University
Liu, Jinxin	Nanyang Technological University
Cao, Kun	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: SLAM, Mapping, Sensor-based Control Abstract: Accurate and comprehensive 3D sensing using LiDAR systems is crucial for various applications in photogrammetry and robotics, including facility inspection, Building Information Modeling (BIM), and robot navigation. Motorized LiDAR systems can expand the Field of View (FoV) without adding multiple scanners, but existing motorized LiDAR systems often rely on constant-speed motor control, leading to suboptimal performance in complex environments. To address this, we propose UA-MPC, an uncertainty-aware motor control strategy that balances scanning accuracy and efficiency. By predicting discrete observabilities of LiDAR Odometry (LO) through ray tracing and modeling their distribution with a surrogate function, UA-MPC efficiently optimizes motor speed control according to different scenes. Additionally, we develop a ROS-based realistic simulation environment for motorized LiDAR systems, enabling the evaluation of control strategies across diverse scenarios. Extensive experiments, conducted on both simulated and real-world scenarios, demonstrate that our method significantly improves odometry accuracy while preserving the scanning efficiency of motorized LiDAR systems. Specifically, it achieves over a 60% reduction in positioning error with less than a 2% decrease in efficiency compared to constant-speed control, offering a smarter and more effective solution for active 3D sensing tasks. The simulation environment for control motorized LiDAR is open-sourced at: url{https://github.com/kafeiyin00/UA-MPC.git}.

16:50-16:55, Paper WeDT29.3
EFEAR-4D: Ego-Velocity Filtering for Efficient and Accurate 4D Radar Odometry

Wu, Xiaoyi	Harbin Institute of Technology, Shenzhen
Chen, YuShuai	Harbin Institute of Technology, Shenzhen
Li, Zhan	Harbin Institute of Technology
Hong, Ziyang	Heriot-Watt University
Hu, Liang	Harbin Institute of Technology, Shenzhen
Keywords: SLAM, Range Sensing, Localization Abstract: Odometry is a crucial component for successfully implementing autonomous navigation, relying on sensors such as cameras, LiDARs and IMUs. However, these sensors may encounter challenges in extreme weather conditions, such as snowfall and fog. The emergence of FMCW radar technology offers the potential for robust perception in adverse conditions. As the latest generation of FWCW radars, the 4D mmWave radar provides point cloud with range, azimuth, elevation, and Doppler velocity information, despite inherent sparsity and noises in the point cloud. In this paper, we propose EFEAR-4D, an accurate, highly efficient, and learning-free method for large-scale 4D radar odometry estimation. EFEAR-4D exploits Doppler velocity information delicately for robust ego-velocity estimation, resulting in a highly accurate prior guess. EFEAR-4D maintains robustness against point-cloud sparsity and noises across diverse environments through dynamic object removal and effective region-wise feature extraction. Extensive experiments on two publicly available 4D radar datasets demonstrate state-of-the-art reliability and localization accuracy of EFEAR-4D under various conditions. Furthermore, we have collected a dataset following the same route but varying installation heights of the 4D radar, emphasizing the significant impact of radar height on point cloud quality - a crucial consideration for real-world deployments. Our algorithm and dataset will be available soon at https://github.com/CLASS-Lab/EFEAR-4D.

16:55-17:00, Paper WeDT29.4
Incorporating Point Uncertainty in Radar SLAM

Xu, Yang	The Hong Kong University of Science and Technology
Huang, Qiucan	Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Yin, Huan	Hong Kong University of Science and Technology
Keywords: SLAM, Range Sensing, Robotics in Hazardous Fields Abstract: Radar SLAM is robust in challenging conditions, such as fog, dust, and smoke, but suffers from the sparsity and noisiness of radar sensing, including speckle noise and multipath effects. This study provides a performance-enhanced radar SLAM system by incorporating point uncertainty. The basic system is a radar-inertial odometry system that leverages velocity-aided radar points and high-frequency inertial measurements. We first propose to model the uncertainty of radar points in polar coordinates by considering the nature of radar sensing. Then, the proposed uncertainty model is integrated into the data association module and incorporated for back-end state estimation. Real-world experiments on both public and self-collected datasets validate the effectiveness of the proposed models and approaches. The findings highlight the potential of incorporating point uncertainty to improve the radar SLAM system. We make the code and collected dataset publicly available at https://github.com/HKUST-Aerial-Robotics/RIO.

17:00-17:05, Paper WeDT29.5
SceneFactory: A Workflow-Centric and Unified Framework for Incremental Scene Modeling

Yuan, Yijun	University of Wuerzburg
Bleier, Michael	Julius Maximilian University of Würzburg
Nuechter, Andreas	University of Würzburg
Keywords: SLAM, RGB-D Perception, Mapping, Photogrammetry Abstract: We present SceneFactory, a workflow-centric and unified framework for incremental scene modeling, that conveniently supports a wide range of applications, such as (unposed and/or uncalibrated) multi-view depth estimation, LiDAR completion, (dense) RGB-D/RGB-L/Mono/Depth-only reconstruction and SLAM. The workflow-centric design uses multiple blocks as the basis for constructing different production lines. The supported applications, i.e., productions avoid redundancy in their designs. Thus, the focus is placed on each block itself for independent expansion. To support all input combinations, our implementation consists of four building blocks that form SceneFactory: (1) tracking, (2) flexion, (3) depth estimation, and (4) scene reconstruction. The tracking block is based on Mono SLAM and is extended to support RGB-D and RGB-LiDAR (RGB-L) inputs. Flexion is used to convert the depth image (untrackable) into a trackable image. For general-purpose depth estimation, we propose an unposed & uncalibrated multi-view depth estimation model (U^2-MVD) to estimate dense geometry. U^2-MVD exploits dense bundle adjustment to solve for poses, intrinsics, and inverse

17:05-17:10, Paper WeDT29.6
Multimodal Fusion SLAM with Fourier Attention

Zhou, Youjie	Shandong University
Mei, Guofeng	Fondazione Bruno Kessler
Wang, Yiming	Fondazione Bruno Kessler
Wan, Yi	Shandong University
Poiesi, Fabio	Fondazione Bruno Kessler
Keywords: SLAM, RGB-D Perception, Sensor Fusion Abstract: Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness.Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency.Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals.We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities.We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment.Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions. Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.


WeDT30	106
Aerial Systems: Perception and Autonomy 2	Regular Session
Chair: Vanderborght, Bram	Vrije Universiteit Brussel

16:40-16:45, Paper WeDT30.1
UEVAVD: A Dataset for Developing UAV's Eye View Active Object Detection

Jiang, Xinhua	National University of Defense Technology
Liu, Tianpeng	National University of Defense Technology
Liu, Li	National University of Defense Technology
Liu, Zhen	National University of Defense Technology
Liu, Yongxiang	National University of Defense Technology
Keywords: Aerial Systems: Perception and Autonomy, Data Sets for Robotic Vision, Reinforcement Learning Abstract: Occlusion is a longstanding difficulty that challenges the UAV-based object detection. Many works address this problem by adapting the detection model. However, few of them exploit that the UAV could fundamentally improve detection performance by changing its viewpoint. Active Object Detection (AOD) offers an effective way to achieve this purpose. Through Deep Reinforcement Learning (DRL), AOD endows the UAV with the ability of autonomous path planning to search for the observation that is more conducive to target identification. Unfortunately, there exists no available dataset for developing the UAV AOD method. To fill this gap, we released a UAV's eye view active vision dataset named UEVAVD and hope it can facilitate research on the UAV AOD problem. Additionally, we improve the existing DRL-based AOD method by incorporating the inductive bias when learning the state representation. First, due to the partial observability, we use the gated recurrent unit to extract state representations from the observation sequence instead of the single-view observation. Second, we pre-decompose the scene with the Segment Anything Model (SAM) and filter out the irrelevant information with the derived masks. With these practices, the agent could learn an active viewing policy with better generalization capability. The effectiveness of our innovations is validated by the experiments on the UEVAVD dataset. Our dataset will soon be available at https://github.com/Leo000ooo/UEVAVD_datase.

16:45-16:50, Paper WeDT30.2
Aerial Gym Simulator: A Framework for Highly Parallelized Simulation of Aerial Robots

Kulkarni, Mihir	NTNU: Norwegian University of Science and Technology
Rehberg, Welf	Norwegian University of Science and Technology
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Keywords: Aerial Systems: Perception and Autonomy, Machine Learning for Robot Control, Reinforcement Learning Abstract: This paper contributes the Aerial Gym Simulator, a highly parallelized, modular framework for simulation and rendering of arbitrary multirotor platforms based on NVIDIA Isaac Gym. Aerial Gym supports the simulation of under-, fully- and over-actuated multirotors offering parallelized geometric controllers, alongside a custom GPU-accelerated rendering framework for ray-casting capable of capturing depth, segmentation and vertex-level annotations from the environment. Multiple examples for key tasks, such as depth-based navigation through reinforcement learning are provided. The comprehensive set of tools developed within the framework makes it a powerful resource for research on learning for control, planning, and navigation using state information as well as exteroceptive sensor observations. Extensive simulation studies are conducted and successful sim2real transfer of trained policies is demonstrated. The Aerial Gym Simulator is open-sourced at: https://github.com/ntnu-arl/aerial_gym_simulator.

16:50-16:55, Paper WeDT30.3
Automated Behavior Planning for Fruit Tree Pruning Via Redundant Robot Manipulators: Addressing the Behavior Planning Challenge

Liu, Gaoyuan	Vrije Universiteit Brussel
Boom, Bas	Imec
Slob, Naftali	OnePlanet Research Center
Durodié, Yuri	Vrije Universiteit Brussel
Nowé, Ann	VUB
Vanderborght, Bram	Vrije Universiteit Brussel
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Manipulation Planning Abstract: Pruning is an essential agricultural practice for orchards. Proper pruning can promote healthier growth and optimize fruit production throughout the orchard's lifespan. Robot manipulators have been developed as an automated solution for this repetitive task, which typically requires seasonal labor with specialized skills. While previous research has primarily focused on the challenges of perception, the complexities of manipulation are often overlooked. These challenges involve planning and control in both joint and Cartesian spaces to guide the end-effector through intricate, obstructive branches. Our work addresses the behavior planning challenge for a robotic pruning system, which entails a multi-level planning problem in environments with complex collision dynamics. In this paper, we formulate the planning problem for a high-dimensional robotic arm in a pruning scenario, investigate the system's intrinsic redundancies, and propose a comprehensive pruning workflow that integrates perception, modeling, and holistic planning. In our experiments, we demonstrate that more comprehensive planning methods can significantly enhance the performance of the robotic manipulator. Finally, we

16:55-17:00, Paper WeDT30.4
Using YOLOv5-DSE for Egg Counting in Conventional Scale Layer Farms (I)

Wu, Dihua	Zhejiang University
Cui, Di	Zhejiang University
Zhou, Mingchuan	Zhejiang University
Yanan, Wang	Zhejiang University
Pan, Jinming	Zhejiang University
Ying, Yibin	Zhejiang University
Keywords: Agricultural Automation, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: Given that common egg counting methods in conventional layer farms are inefficient and costly, there is a growing demand for cost-effective solutions with high counting accuracy, expandable functionality, and flexibility that can be easily shared between different coops. However, accurate real-time egg counting faces challenges due to small size, density variation, and egg similarity, exacerbated by dynamic poses. Moreover, current animal industry methods emphasize single-image counting, limiting suitability for video-based counting due to a lack of frame-to-frame target association. The you only look once version 5-DeepSORT-spatial encoding (YOLO v5-DSE) algorithm is proposed as a solution for efficient and reliable egg counting to tackle these issues. The algorithm contains the following three main modules: 1) the egg detector utilizes the improved YOLOv5 to locate eggs in video frames automatically, 2) the DeepSORT-based tracking module is employed to continuously track each egg's position between frames, preventing the detector from losing egg localization, and 3) the spatial encoding (SE) module is designed to count eggs. Extensive experiments are conducted on 4808 eggs on a commercial farm. Our proposed egg-counting approach achieves a counting accuracy of 99.52% and a speed of 22.57 fps, surpassing not only the DeepSORT-SE and ByteTrack-SE versions of eight advanced YOLO-series object detectors (YOLOX, and YOLOv6-v9) but also other egg-counting methods. The proposed YOLOv5-DSE provides real-time and reliable egg counting for commercial layer farms. This approach could be further expanded to the egg conveyor to locate cages for low-lying hens and help companies cull more efficiently.

17:00-17:05, Paper WeDT30.5
Bimanual Grape Manipulation for Human-Inspired Robotic Harvesting (I)

Stavridis, Sotiris	Aristotle University of Thessaloniki
Droukas, Leonidas	Aristotle University of Thessaloniki
Doulgeri, Zoe	Aristotle University of Thessaloniki
Keywords: Agricultural Automation, Bimanual Manipulation Abstract: Most existing robotic harvesters utilize a unimanual approach with a single arm grasping and detaching the crop, either via a detachment movement, or stem cutting by a specially designed gripper/cutter end-effector. However, such unimanual solutions cannot be applied for sensitive crops and cluttered environments such as grapes, where obstacles may occlude the stem, leaving no space for the cutter's placement. In such cases, the solution would require a bimanual robot that visually unveils the stem, while manipulating the grasped crop to create cutting affordances. Considering grapes vertical trellis setups, a dual-arm coordinated motion control methodology is proposed in this work for reaching a stem precut state. The camera equipped arm with the cutter reaches the stem, visually unveiling it, while the second arm moves the grasped grape toward the surrounding free-space, facilitating its stem cutting. In-lab experimental validation and extensive evaluation in a real vineyard with the BACCHUS bimanual harvesting platform demonstrate the performance of the proposed approach.

17:05-17:10, Paper WeDT30.6
Differentiable Space Carving for 3D Reconstruction Using Imaging Sonar

Feng, Yunxuan	Harbin Institute of Technology(Shen Zhen)
Lu, Wenjie	Harbin Institute of Technology (Shenzhen)
Gao, Haowen	Harbin Institute of Technology, Shenzhen
Nie, Binyu	Harbin Institute of Technology，shenzhen
Lin, Kaiyang	Harbin Institute of Technology, Shenzhen
Hu, Liang	Harbin Institute of Technology, Shenzhen
Keywords: AI-Based Methods, Sensorimotor Learning, Learning from Experience Abstract: Effective 3D reconstruction utilizing imaging sonars is vital for underwater robots, particularly in turbid water conditions. The absence of elevation angles in acoustic echo measurements significantly slows down the Neural Radiance Field (NeRF) method. This is attributed to the differentiable rendering model of sonar images, which, unlike visual imagery, needs to generate samples covering a spatial sector rather than a ray for each pixel. To address this, we present a fast 3D reconstruction method using sonar images, termed Differentiable Space Carving (DSC). DSC carves the space iteratively by rendering echo probabilities instead of echo intensities, eliminating the need for the intensity network commonly found in NeRF models. The absence of occupancy-echo correspondences is effectively tackled through backpropagation guided by rendering losses. Additionally, we leverage occupancy probability grids and multiresolution hash encoding to construct differentiable occupancy models, ensuring faster convergence compared to multilayer perceptrons. The experiments have been conducted in numerically simulated environments and with datasets from a laboratory tank. Compared to existing NeRF methods, DSC reconstructs objects about ten times faster and provides more details.

17:10-17:15, Paper WeDT30.7
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Wen, Junjie	East China Normal University
Zhu, Yichen	Midea Group
Li, Jinming	Shanghai University
Zhu, MinJie	East China Normal University
Tang, Zhibin	Shanghai Jingzhi
Wu, Kun	Syracuse University
Xu, Zhiyuan	Midea Group
Liu, Ning	Beijing Innovation Center of Humanoid Robotics
Cheng, Ran	Midea Robozone
Shen, Chaomin	East China Normal University
Peng, Yaxin	Shanghai University
Feng, Feifei	Midea Group
Tang, Jian	Midea Group (Shanghai) Co., Ltd
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation Abstract: Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that TinyVLA offers an interesting perspective on utilizing pre-trained multimodal models for policy learning.

17:15-17:20, Paper WeDT30.8
Generalized Modeling of Overactuated Aerial Manipulators: Theory and Application (I)

Markovic, Lovro	University of Zagreb, Faculty of Electrical Engineering and Comp
Car, Marko	Faculty of Electrical Engineering and Computing
Orsag, Matko	University of Zagreb, Faculty of Electrical Engineering and Comp
Bogdan, Stjepan	University of Zagreb
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Kinematics Abstract: This paper proposes a unified dynamic model for aerial robots which encompasses all known actuation principles including tilting propellers and centroid variation methods such as moving masses or robotic manipulators. Of course, one can envision a wide variety of vehicles, all with different combinations of actuation principles. Therefore, a generalized modeling methodology for such vehicles is developed and presented in this paper. The modeling approach is verified through a comparative analysis of MATLAB simulations and laboratory experiments with a tilting propeller aerial manipulator vehicle, named Toucan. Finally, in order to fully explore and exploit the capabilities of the designed aerial manipulator, contact-based experiments with force tracking are performed using the proposed adaptive impedance controller. To demonstrate the advantages of such a vehicle, the adaptive impedance control method is used to generate position and orientation commands and achieve end-effector force tracking on a flat surface while the aerial manipulator maintains a neutral attitude. This greatly increases the stability and safety of contact-based operations, as the vehicle does not require changes in attitude to meet the force requirements.

Technical Program for Wednesday October 22, 2025