ICRA 2025 Program | Wednesday May 21, 2025


WeAT1 Regular Session, 302	Add to My Program
Award Finalists 5

Chair: Hauser, Kris	University of Illinois at Urbana-Champaign
Co-Chair: Wang, Michael Yu	Xian Jiaotong University

08:30-08:35, Paper WeAT1.1	Add to My Program
Full-Order Sampling-Based MPC for Torque-Level Locomotion Control Via Diffusion-Style Annealing

Xue, Haoru	University of California Berkeley
Pan, Chaoyi	Carnegie Mellon University
Yi, Zeji	Carnegie Mellon University
Qu, Guannan	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Legged Robots, Optimization and Optimal Control, Machine Learning for Robot Control Abstract: Due to high dimensionality and non-convexity, real-time optimal control using full-order dynamics models for legged robots is challenging. Therefore, Nonlinear Model Predictive Control (NMPC) approaches are often limited to reduced-order models. Sampling-based MPC has shown potential in nonconvex even discontinuous problems, but often yields suboptimal solutions with high variance, which limits its applications in high-dimensional locomotion. This work introduces DIAL-MPC (Diffusion-Inspired Annealing for Legged MPC), a sampling-based MPC framework with a novel diffusion-style annealing process. Such an annealing process is supported by the theoretical landscape analysis of Model Predictive Path Integral Control (MPPI) and the connection between MPPI and single-step diffusion. Algorithmically, DIAL-MPC iteratively refines solutions online and achieves both global coverage and local convergence. In quadrupedal torque-level control tasks, DIAL-MPC reduces the tracking error of standard MPPI by 13.4 times and outperforms reinforcement learning (RL) policies by 50% in challenging climbing tasks without any training. In particular, DIAL-MPC enables precise real-world quadrupedal jumping with payload. To the best of our knowledge, DIAL-MPC is the first training-free method that optimizes over full-order quadruped dynamics in real-time.

08:35-08:40, Paper WeAT1.2	Add to My Program
D(R, O) Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping

Wei, Zhenyu	Shanghai Jiao Tong University
Xu, Zhixuan	National University of Singapore
Guo, Jingxiang	National University of Singapore
Hou, Yiwen	University of Science and Technology of China
Gao, Chongkai	National University of Singapore
Cai, Zhehao	National University of Singapore
Luo, Jiayu	National University of Singapore
Shao, Lin	National University of Singapore
Keywords: Deep Learning in Grasping and Manipulation, Grasping, Dexterous Manipulation Abstract: Dexterous grasping is a fundamental yet challenging skill in robotic manipulation, requiring precise interaction between robotic hands and objects. In this paper, we present D(R,O) Grasp, a novel framework that models the interaction between the robotic hand in its grasping pose and the object, enabling broad generalization across various robot hands and object geometries. Our model takes the robot hand's description and object point cloud as inputs and efficiently predicts kinematically valid and stable grasps, demonstrating strong adaptability to diverse robot embodiments and object geometries. Extensive experiments conducted in both simulated and real-world environments validate the effectiveness of our approach, with significant improvements in success rate, grasp diversity, and inference speed across multiple robotic hands. Our method achieves an average success rate of 87.53% in simulation in less than one second, tested across three different dexterous robotic hands. In real-world experiments using the LeapHand, the method also demonstrates an average success rate of 89%. D(R,O) Grasp provides a robust solution for dexterous grasping in complex and varied environments. The code, appendix, and videos are available on our project website at https://nus-lins-lab.github.io/drograspweb/.

08:40-08:45, Paper WeAT1.3	Add to My Program
TrofyBot: A Transformable Rolling and Flying Robot with High Energy Efficiency

Lai, Mingwei	Zhejiang University
Ye, Yuqian	Stanford University
Wu, Hanyu	ETH Zurich
Xuan, Chice	Huzhou Institute of Zhejiang University, Huzhou
Zhang, Ruibin	Zhejiang University
Ren, Qiuyu	Zhejiang University
Xu, Chao	Zhejiang University
Gao, Fei	Zhejiang University
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Keywords: Aerial Systems: Mechanics and Control, Dynamics, Motion Control Abstract: Terrestrial and aerial bimodal vehicles have gained significant interest due to their energy efficiency and versatile maneuverability across different domains. However, most existing passive-wheeled bimodal vehicles rely on attitude regulation to generate forward thrust, which inevitably results in energy waste on producing lifting force. In this work, we propose a novel passive-wheeled bimodal vehicle called TrofyBot that can rapidly change the thrust direction with a single servo motor and a transformable parallelogram linkage mechanism (TPLM). Cooperating with a bidirectional force generation module (BFGM) for motors to produce bidirectional thrust, the robot achieves flexible mobility as a differential driven rover on the ground. This design achieves 95.37% energy saving efficiency in terrestrial locomotion, allowing the robot continuously move on the ground for more than two hours in current setup. Furthermore, the design obviates the need for attitude regulation and therefore provides a stable sensor field of view (FoV). We model the bimodal dynamics for the system, analyze its differential flatness property, and design a controller based on hybrid model predictive control for trajectory tracking. A prototype is built and extensive experiments are conducted to verify the design and the proposed controller, which achieves high energy efficiency and seamless transition between modes.

08:45-08:50, Paper WeAT1.4	Add to My Program
Geometric Design and Gait Co-Optimization for Soft Continuum Robots Swimming at Low and High Reynolds Numbers

Yang, Yanhao	Oregon State University
Hatton, Ross	Oregon State University
Keywords: Nonholonomic Motion Planning, Flexible Robotics, Kinematics Abstract: Recent advancements in soft actuators have enabled soft continuum swimming robots to achieve higher efficiency and more closely mimic the behaviors of real marine animals. However, optimizing the design and control of these soft continuum robots remains a significant challenge. In this paper, we present a practical framework for the co-optimization of the design and control of soft continuum robots, approached from a geometric locomotion analysis perspective. This framework is based on the principles of geometric mechanics, accounting for swimming at both low and high Reynolds numbers. By generalizing geometric principles to continuum bodies, we achieve efficient geometric variational co-optimization of designs and gaits across different power consumption metrics and swimming environments. The resulting optimal designs and gaits exhibit greater efficiencies at both low and high Reynolds numbers compared to three-link or serpenoid swimmers with the same degrees of freedom, approaching or even surpassing the efficiencies of infinitely flexible swimmers and those with higher degrees of freedom.

08:50-08:55, Paper WeAT1.5	Add to My Program
ShadowTac: Dense Measurement of Shear and Normal Deformation of a Tactile Membrane from Colored Shadows

Vitrani, Giuseppe	Delft University of Technology (TU Delft)
Pasquale, Basile	École Polytechnique Fédérale De Lausanne
Wiertlewski, Michael	TU Delft
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Grasping Abstract: To robustly handle objects, robots must perceive mechanical interactions through touch with sufficient richness. New tactile sensors leverage miniature cameras to provide dense measurements of these interactions, allowing for the extraction of material properties and frictional information. Among the plethora of solutions, retrographic sensing is popular for its ability to finely resolve the shape of the object being touched. This sensor uses a reflective membrane, illuminated at a shallow angle by three RGB lights, which cast colored shadows. From the illumination pattern of the deformed membrane, both the normal deformation and fine surface details can be recovered. However, these retrographic sensors cannot detect the lateral displacement of the membrane and, therefore, overlook frictional information, which is crucial for grasping and manipulation. Embedding and tracking opaque markers has been a makeshift solution, but these markers occlude the membrane and are difficult to manufacture. In this paper, we introduce ShadowTac, a tactile sensor that combines retrographic illumination with non-intrusive markers created by colored shadows. We patterned the retrographic surface with a dense array of submillimeter dimples, which are small enough not to obstruct the view yet cast shadows large enough to be visible to the camera. ShadowTac captures a dense image of both the normal displacement field with fine details and a precise lateral displacement field by tracking the markers. Additionally, our sensor is easy to manufacture, as the dimple pattern can simply be molded. We evaluated the measurement reliability of ShadowTac and its effectiveness in estimating the incipient slip of arbitrary objects. The dense measurement of both the normal and shear deformation that the sensor captures makes it ideal for tracking dynamic interactions between robotic fingertips and manipulated objects.

08:55-09:00, Paper WeAT1.6	Add to My Program
Occlusion-Aware 6D Pose Estimation with Depth-Guided Graph Encoding and Cross-Semantic Fusion for Robotic Grasping

Liu, Jingyang	Shanxi University
Lu, Zhenyu	South China University of Technology
Chen, Lu	Shanxi University
Yang, Jing	Shanxi University
Yang, Chenguang	University of Liverpool
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping Abstract: Reliable 6D pose estimation is crucial for robotic tasks but presents significant challenges in environments with occlusion. Recent approaches tend to directly predict pose parameters of object with deep neural networks, lacking the modeling ability of non-adjacent and complex relationships of surface points in occluded scenarios. To solve this problem, we propose a novel occlusion-aware 6D pose estimation framework, which uses depth-guided graph neural network (GNN) to model potential relationships from RGBD input. Two semantic information, which are mask and binary code of object, are adaptively fused to extract 2D-3D correspondence related features in an effective manner. Both enhanced graph features and fused semantic information contribute to the performance improvement of pose estimation with occlusion. Extensive experiments indicate that our approach outperforms comparative methods by 1.2% and 1.9% on LMO and YCBV datasets (up to 30% for certain objects) and its validity is also verified under real-world pose estimation test.


WeAT2 Regular Session, 301	Add to My Program
SLAM 3

Chair: Civera, Javier	Universidad De Zaragoza
Co-Chair: Kwon, Cheolhyeon	Ulsan National Institute of Science and Technology

08:30-08:35, Paper WeAT2.1	Add to My Program
JPG-SLAM: Joint Point-Gaussian Splatting Representation for Dense Dynamic SLAM

Huang, Kunrui	Wuhan University
Yang, Wennan	Wuhan University
Zhou, Pengwei	Affiliation (University, Organization, Company)*
Li, Li	Wuhan University
Yao, Jian	Wuhan University
Keywords: SLAM, RGB-D Perception Abstract: This paper presents a simultaneous localization and mapping (SLAM) system to provide accurate pose estimation and dynamic scene reconstruction. Our approach proposes a Joint Point-Gaussian Splatting representation, which fully integrates the robustness of isotropic feature points in pose estimation and the flexibility of anisotropic 3D Gaussians in scene representation. This system does not need to suppress the anisotropic representation of Gaussian elements, which enables the mapping module to achieve finer scene representation with lower memory consumption. Additionally, in order to enhance the adaptability of the system in dynamic environments, we introduced a dynamic region recognition module and utilized 3D Gaussian Splatting and 4D Gaussian Splatting representations to represent static and dynamic regions respectively. Furthermore, we developed a local map management strategy for Gaussian Splatting mapping, effectively reducing the memory and computational resource usage in the mapping process. Experiments on public datasets demonstrate that our system achieves state-of-the-art tracking and mapping accuracy compared to existing baselines.

08:35-08:40, Paper WeAT2.2	Add to My Program
FMCW-LIO: A Doppler LiDAR-Inertial Odometry

Zhao, Mingle	University of Macau
Wang, Jiahao	University of Macau
Gao, Tianxiao	University of Macao
Xu, Chengzhong	University of Macau
Kong, Hui	University of Macau
Keywords: Sensor Fusion, Localization, SLAM Abstract: Conventional LiDAR-inertial odometry (LIO) or SLAM methods heavily rely on geometric features of environments, as LiDARs primarily provide range measurements instead of motion measurements. From now on, however, the situation changes thanks to the novel Frequency Modulated Continuous Wave (FMCW) LiDARs. FMCW LiDARs not only offer the point range with high resolution but also capture the instant point Doppler velocity through the Doppler effect. In the letter, we propose FMCW-LIO, a novel and robust LIO, leveraging intrinsic Doppler measurements from FMCW LiDARs. To correctly exploit Doppler velocities, a motion compensation method is designed, and a Doppler-aided observation model is applied for on-manifold state estimation. Then, dynamic points can be effectively removed by the Doppler criteria, deriving more consistent geometric observations. FMCW-LIO eventually achieves accurate state estimation and static mapping, even in structure-degenerated environments. Extensive experiments in diverse scenes are performed and FMCW-LIO outperforms other algorithms on both accuracy and robustness.

08:40-08:45, Paper WeAT2.3	Add to My Program
Submodular Optimization for Keyframe Selection & Usage in SLAM

Thorne, David	University of California, Los Angeles
Chan, Nathan	University of California, Los Angeles
Ma, Yanlong	University of California, Los Angeles
Robison, Christopher, Christa	Army Research Laboratory
Osteen, Philip	U.S. Army Research Laboratory
Lopez, Brett	University of California, Los Angeles
Keywords: SLAM, Optimization and Optimal Control, Field Robots Abstract: Keyframes are LiDAR scans saved for future reference in Simultaneous Localization And Mapping (SLAM), but despite their central importance most algorithms leave choices of which scans to save and how to use them to wasteful heuristics. This work proposes two novel keyframe selection strategies for localization and map summarization, as well as a novel approach to submap generation which selects keyframes that best constrain localization. Our results show that online keyframe identification and submap generation reduce the number of saved keyframes and improve per scan computation time without compromising localization performance. We also present a map summarization feature for quickly capturing environments under strict map size constraints.

08:45-08:50, Paper WeAT2.4	Add to My Program
Equivariant Filter Design for Range-Only SLAM

Ge, Yixiao	Australian National University
Pearce, Arthur	Australian National University
van Goor, Pieter	University of Twente
Mahony, Robert	Australian National University
Keywords: SLAM, Range Sensing, Mapping Abstract: Range-only Simultaneous Localisation and Mapping (RO-SLAM) is of interest in the robotics community due to its practical applications; for example, ultra-wideband (UWB) and Bluetooth Low Energy (BLE) localisation in terrestrial and aerial applications and acoustic beacon localisation in marine applications. In this work, we consider a mobile robot equipped with an inertial measurement unit (IMU) and a range sensor that measures distances to a collection of fixed landmarks. We derive an equivariant filter (EqF) for the RO-SLAM problem based on a symmetry Lie group that is compatible with the range measurements. The proposed filter does not require bootstrapping or initialisation of landmark positions, and demonstrates robustness to the no-prior situation. The filter is demonstrated on a real-world dataset, and it is shown to significantly outperform a state-of-the-art EKF alternative in terms of both accuracy and robustness.

08:50-08:55, Paper WeAT2.5	Add to My Program
Toward Globally Optimal State Estimation Using Automatically Tightened Semidefinite Relaxations

Dümbgen, Frederike	ENS, PSL University
Holmes, Connor	University of Toronto
Agro, Ben	University of Toronto
Barfoot, Timothy	University of Toronto
Keywords: Optimization and Optimal Control, Localization, Robot Safety, Global Optimality Abstract: In recent years, semidefinite relaxations of common optimization problems in robotics have attracted growing attention due to their ability to provide globally optimal solutions. In many cases, it was shown that specific handcrafted redundant constraints are required to obtain tight relaxations, and thus global optimality. These constraints are formulation-dependent and typically identified through a lengthy manual process. Instead, the present article suggests an automatic method to find a set of sufficient redundant constraints to obtain tightness, if they exist. We first propose an efficient feasibility check to determine if a given set of variables can lead to a tight formulation. Second, we show how to scale the method to problems of bigger size. At no point of the process do we have to find redundant constraints manually. We showcase the effectiveness of the approach, in simulation and on real datasets, for range-based localization and stereo-based pose estimation. We also reproduce semidefinite relaxations presented in recent literature and show that our automatic method always finds a smaller set of constraints sufficient for tightness than previously considered.

08:55-09:00, Paper WeAT2.6	Add to My Program
Viewpoint-Aware Visibility Scoring for Point Cloud Registration in Loop Closure

Yoon, Ilseung	Ulsan National Institute of Science and Technology
Islam, Tariq	Ulsan National Institute of Science and Technology
Kim, Kwangrok	Ulsan National Institute of Science and Technology
Kwon, Cheolhyeon	Ulsan National Institute of Science and Technology
Keywords: SLAM, Autonomous Vehicle Navigation, Mapping Abstract: Abstract—Lidar-based Simultaneous Localization and Mapping (SLAM) encounters a substantial challenge in the form of accumulating errors, which can adversely impact its reliability. Loop closing techniques have been extensively employed to counteract this issue. Nonetheless, the loop closing conundrum remains difficult to resolve, as point clouds often exhibit partial overlap due to disparities in scanning pose (viewpoints). This renders the conventional point cloud registration such as Iterative Closest Point (ICP) algorithm problematic. To overcome this challenge, this paper proposes a two-stage viewpoint-aware point cloud registration technique that assigns suitable weights to the correspondence pairs associating two point clouds from different viewpoints. The weights account for the visibility of points from their respective viewpoint as well as from the viewpoint of the counterpart point cloud, making the registration more relying on commonly visible points from the both viewpoints. Experimental results, utilizing the KITTI and Apollo-SouthBay dataset, indicate that the proposed technique delivers more precise and robust performance compared to the baseline techniques.


WeAT3 Regular Session, 303	Add to My Program
Mechanism Design 1

Chair: Whitney, John Peter	Northeastern University
Co-Chair: Herneth, Christopher	Technical University Munich

08:30-08:35, Paper WeAT3.1	Add to My Program
Tension Dependent Twisted String Actuator Modelling and Efficacy Benchmarking in Force and Impedance Control

Herneth, Christopher	Technical University Munich
Cheng, Yi	Technical University of Munich
Ganguly, Amartya	Technical University of Munich
Haddadin, Sami	Technical University of Munich
Keywords: Actuation and Joint Mechanisms, Force Control, Tendon/Wire Mechanism Abstract: This study presents a comprehensive experimental analysis of Twisted String Actuators (TSA), focused on enhancing contraction modelling accuracy and establishing a baseline for TSA tension and impedance control efficacy. A novel TSA string radius function is introduced, computing effective radii for multi-strand bundles based on axial actuator tension. The proposed model was validated in physical experiments, resulting in a reduction of maximal errors between measured and simulated actuator contraction trajectories from up to 60% in established models to around 10% in our work. Additionally, the tension-dependent radius modification effectively reduced errors between the estimated and the measured bundle tension by an order of magnitude, marking an essential step towards TSA control independent of bundle tension measurements. TSA tension control was assessed based on four metrics: accuracy, precision, impact stability, and bandwidth, following ISO 9283:1998 standards. The quality of tension control was found to be dependent on bundle tension, twisting angle and strand quantity, whereas impact stability was maintained in all configurations. Joint impedance control with TSA was evaluated for perturbation stability and position control bandwidth, where the latter was enhanced with increasing joint stiffness. The presented analysis informs designers about the capabilities of TSAs in different configurations, and their respective suitability for desired applications.

08:35-08:40, Paper WeAT3.2	Add to My Program
A Novel Twisted-Winching String Actuator for Robotic Applications: Design and Validation

Poon, Ryan	Massachusetts Institute of Technology
Padia, Vineet	MIT
Hunter, Ian	MIT
Keywords: Tendon/Wire Mechanism, Mechanism Design, Actuation and Joint Mechanisms Abstract: This paper presents a novel actuator system combining a twisted string actuator (TSA) with a winch mechanism. Relative to traditional hydraulic and pneumatic systems in robotics, TSAs are compact and lightweight but face limitations in stroke length and force-transmission ratios. Our integrated TSA-winch system overcomes these constraints by providing variable transmission ratios through dynamic adjustment. It increases actuator stroke by winching instead of overtwisting, and it improves force output by twisting. The design features a rotating turret that houses a winch, which is mounted on a bevel gear assembly driven by a through-hole drive shaft. Mathematical models are developed for the combined displacement and velocity control of this system. Experimental validation demonstrates the actuator's ability to achieve a wide range of transmission ratios and precise movement control. We present performance data on movement precision and generated forces, discussing the results in the context of existing literature. This research contributes to the development of more versatile and efficient actuation systems for advanced robotic applications and improved automation solutions.

08:40-08:45, Paper WeAT3.3	Add to My Program
Design and Evaluation of High-Performance Motion-Decoupled Cable Transmission Modules

Takei, Ryo	Northeastern University
Frishman, Samuel	Stanford University
Whitney, John Peter	Northeastern University
Keywords: Tendon/Wire Mechanism, Actuation and Joint Mechanisms, Medical Robots and Systems Abstract: Cable transmissions are commonly used in robotics for remote force transmission, offering a lightweight, compact, and efficient solution for transmitting high forces between input and output. However, cables in flexible compression housings (Bowden cables), exhibit high static friction, which increases exponentially with total bend angle. Alternatively, internally routed ball-bearing supported cable capstan transmissions are low friction, but complex and present challenges in routing multiple sets of cables. In this paper, we propose motion-decoupled cable transmission modules that address these challenges, occupying the middle ground, functioning as discrete-joint ball-bearing supported Bowden cables. Our rolling-plus-twist joint design decouples pairs of routed cables from changing significantly in tension, length, or friction during large angle motion of the linked transmission. Using sub-1 mm diameter high-strength synthetic cable, the transmission exhibits a maximum coupling motion of only 0.15 mm over the full range of motion of the cable-transmission mechanism, approximately 10% of pretension in combined hysteresis and friction, a transmission stiffness of 10 N/mm, weighing just 9 g per rolling joint and 5 g per twist joint. Two applications are demonstrated: cable routing alongside a robot arm for, say, gripper remote actuation, and remote needle advancement for an MRI-safe needle biopsy robot.

08:45-08:50, Paper WeAT3.4	Add to My Program
Advanced Xθ Reluctance Electromagnetic Micropositioning System for Precision Motion Control

Pumphrey, Michael Joseph	University of Guelph
Alatawneh, Natheer	University of Guelph
Al Janaideh, Mohammad	University of Guelph
Keywords: Actuation and Joint Mechanisms Abstract: This study examines a novel setup of a micropositioning trajectory manipulator in Xθ, energized by a reluctance actuator (RA) and two accompanying moving magnet actuators (MMA). The design is characterized by a C-core RA, which features asymmetrical air gaps between the mover and the stator elements when under angular θ rotation. When the stator coil is energized, a magnetic flux induces a force in the mover. Two MMAs can add force and torque dynamics to the system via solenoid and permanent magnet (PM) pairs to offer additional corrective actions. Facilitating control of a translational x and rotational θ two-degree-of-freedom (2DOF) actuation system. Flexure hinges aid in the retraction force of the mover element and provide needed stiffness to the system without frictional effects. This was modeled analytically and optimized to achieve outlined performance objectives. The system was validated experimentally through triangle, and sinusoidal trajectories in open loop control. The most relevant application is scanning mirror systems where specific targeted rotational and translational trajectories can benefit light beam positioning. This system allows both translation and rotation specifications of a selected trajectory to be realized in one actuation unit, opening up more design possibilities for controlling precision positioning systems.

08:50-08:55, Paper WeAT3.5	Add to My Program
Cycloidal Quasi-Direct Drive Actuator Designs with Learning-Based Torque Estimation for Legged Robotics

Zhu, Alvin	University of California Los Angeles
Tanaka, Yusuke	University of California, Los Angeles
Rafeedi, Fadi	University of California, Los Angeles
Hong, Dennis	UCLA
Keywords: Machine Learning for Robot Control, Actuation and Joint Mechanisms, Legged Robots Abstract: This paper presents a novel approach through the design and implementation of Cycloidal Quasi-Direct Drive actuators for legged robotics. The cycloidal gear mechanism, with its inherent high torque density and mechanical robustness, offers significant advantages over conventional designs. By integrating cycloidal gears into the Quasi-Direct Drive framework, we aim to enhance the performance of legged robots, particularly in tasks demanding high torque and dynamic loads, while still keeping them lightweight. Additionally, we develop a torque estimation framework for the actuator using an Actuator Network, which effectively reduces the sim-to-real gap introduced by the cycloidal drive’s complex dynamics. This integration is crucial for capturing the complex dynamics of a cycloidal drive, which contributes to improved learning efficiency, agility, and adaptability for reinforcement learning.

08:55-09:00, Paper WeAT3.6	Add to My Program
Compact Modular Robotic Wrist with Variable Stiffness Capability

Sun, Hyunsoo	Korea Institute of Science and Technology
Park, Sungwoo	Korea University, KIST
Hwang, Donghyun	Korea Institute of Science and Technology
Keywords: Mechanism Design, Compliant Joint/Mechanism, Robotic Wrist, Grasping Abstract: We have developed a two-degree-of-freedom robotic wrist with variable stiffness capability, designed for situations where collisions between the end-effector and the environment are inevitable. To enhance environmental adaptability and prevent physical damage, the wrist can operate in a low-stiffness mode. However, the flexibility of this mode might negatively impact stable and precise manipulation. To address this, we proposed a robotic wrist that switches between a passive low-stiffness mode for environmental adaptation and an active high-stiffness mode for precise manipulation. Initially, we developed a functional prototype that could manually switch between these modes, demonstrating the wrist's passive low-stiffness and active high-stiffness states. This prototype was designed as a lightweight, flat-type modular device, incorporating a sheet-type flexure as the motion guide and embedding all essential components, including actuators, sensors, and a control unit, into the wrist module. Based on the functional prototype, we developed an improved version to enhance durability and functionality. The resulting wrist module incorporates a three-axis F/T sensor and an impedance control system to control the stiffness. It measures 55 mm in height, weighs 200 g, and offers a 232.4-fold active stiffness variation.


WeAT4 Regular Session, 304	Add to My Program
Vision Applications

Chair: Xiang, Lirong	North Carolina State University
Co-Chair: Wang, Zhenzhou	Huaibei Normal University

08:30-08:35, Paper WeAT4.1	Add to My Program
A Natural-Neighbor-Interpolant-Based Pattern Modeling Method for Robust Decoding of the Structured Light Pattern (I)

Wang, Zhenzhou	Huaibei Normal University
Liu, Shuo	Fujian Normal University
Keywords: Computer Vision for Automation, Computer Vision for Manufacturing, Recognition Abstract: Active stereo vision (ASV) computes the parallax and depth information from the coded structured light patterns. Thus, it could overcome the difficulties of measuring objects without textures and colors. However, decoding of the structured light patterns at locations of color crosstalk, specular reflection and occlusion remains challenging. In this paper, we propose a natural-neighbor-interpolant-based pattern modeling method to decode the structured light point pattern robustly. The robustness is achieved in the sense of hundred percent point segmentation completeness. Due to the hundred percent completeness, the points in the corresponding blocks are matched directly according to their indexes. Experimental results verified the effectiveness of the proposed method.

08:35-08:40, Paper WeAT4.2	Add to My Program
Automated Video Object Detection of Motile Cells under Microscopy

Song, Haocong	University of Toronto
Chen, Wenyuan	University of Toronto
Shan, Guanqiao	Dalian University of Technology
Sun, Chen	University of Toronto
Wan, Bingqing	University of Toronto
Dai, Changsheng	Dalian University of Technology
Liu, Hang	University of Toronto
Wang, Shanshan	Nanjing Drum Tower Hospital, Affiliated Hospital of Medical Scho
Sun, Yu	University of Toronto
Keywords: Computer Vision for Automation Abstract: Video object detection (VOD) of motile cells (e.g., bacteria and sperm) under microscopy is challenging due to motion blur, sporadic out-of-focus, and pose variations. Compared with VOD in generic scenes, the lower contrast and smaller color space of microscopy imaging further introduce feature overlap between the foreground objects and the background objects (e.g., impurity cells and contaminants). Transformer-based methods have achieved great success in the VOD of generic scenes by utilizing object queries to model the inner-frame objects and the inter-frame objects. However, the appearance overlap problem in microscopy video frames significantly compromises the inter-frame query aggregation by introducing background features into the object query. To tackle this challenge, this paper reports a static-dynamic query-based VOD network that treats object queries of the current video frame and reference video frames differently. Specifically, a two-stage framework is implemented that first generates high-quality object queries of reference frames with a static Transformer decoder pre-trained on a still image dataset. The network is then trained on a per-frame annotated dataset using a dynamic Transformer decoder to model the object queries of the current frame. A Reference Query Relation Module is further proposed to enhance the reference queries for more effective aggregation with the current query. Experiments on clinically collected biopsied sperm datasets validated the effectiveness of the proposed method.

08:40-08:45, Paper WeAT4.3	Add to My Program
Vision-Based Movement Primitives for Lunar Hazard Avoidance

Cloud, Joseph	NASA Kennedy Space Center
Beksi, William J.	The University of Texas at Arlington
Schuler, Jason	NASA Kennedy Space Center
Keywords: Space Robotics and Automation, Mining Robotics, Learning from Demonstration Abstract: To support sustainable infrastructure on the Moon, NASA is developing the In-Situ Resource Utilization (ISRU) Pilot Excavator (IPEx) to extract and transport lunar regolith for processing and construction. During its mission, IPEx will execute various driving patterns, primarily cycling between excavation and unloading sites, with additional maneuvers such as circular traverses around the lander and raster scans for environmental mapping. In this work, dynamic movement primitives (DMPs) are used to represent these patterns. We augment the DMPs with a vision-based real-time obstacle avoidance system to navigate surface hazards, such as rocks, encountered during traversal. Our approach is evaluated in a high-fidelity simulation replicating the challenging environment of the lunar south pole to demonstrate IPEx’s ability to adapt to surface hazards while fulfilling its operational tasks.

08:45-08:50, Paper WeAT4.4	Add to My Program
LAFNET: Lightweight Aerial Fire Detection Model for Onboard Edge Computing

Zhai, Haozhou	Sun Yat-Sen University
Yan, Weiming	Sun Yat-Sen University
Wang, Xiaohan	Sun Yat-Sen University
Zhao, Tuhao	Sun Yat-Sen University
Hu, Tianjiang	Sun Yat-Sen University
Keywords: Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy, Recognition Abstract: Fire poses significant threats to life and property, necessitating efficient inspection and accurate identification. Although aerial computer vision algorithms hold great promise, the computational limitations of onboard platforms prevent existing algorithms from meeting high standards of accuracy and real-time performance. To address this challenge, we propose an lightweight aerial fire detection model, LAFNET. This model incorporates the EffiDarknetLight backbone, optimized for lightweight design, integrates specially designed LG block components within the LG PAN neck, resulting in a model Params of only 1.3M. Experimental results demonstrate that our method attains a good trade-off between lightweight design and detection accuracy. Compared to the smallest standard YOLO series' model YOLOv5n, LAFNET improves MAP by 2.1%, while reducing Params and FLOPs by 27.8% and 29.3%, the inference speed on Nvidia Orin Nano edge computing side improves 24.8%. These experiments indicate that LAFNET offers a highly efficient solution for aerial fire detection, combining speed and accuracy.

08:50-08:55, Paper WeAT4.5	Add to My Program
UDSV: Unsupervised Deep Stitching for Tractor-Trailer Surround View

Sun, Leyao	Beijing Institute of Technology
Liang, Hao	Beijing Institute of Technology
Dong, Zhipeng	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Omnidirectional Vision, Computer Vision for Transportation, Intelligent Transportation Systems Abstract: In recent years, with the rapid development of Advanced Driver Assistance Systems (ADAS), the demand for the precise and efficient surround view stitching system has significantly increased. Traditional stitching methods perform well in small single-unit vehicles with stable camera poses. However, the stitching quality sharply degrades when applied to large tractor-trailers due to the continuous pose changes caused by the non-rigid connection between the tractor and trailer. In detail, first, the extended length of tractor-trailers results in low overlap between cameras, making feature extraction and matching challenging. Additionally, the stitched images often appear irregular, detracting from visual quality. Besides, even if static stitching looks natural, it causes jitter in dynamic scenarios due to random feature extraction. In this paper, we propose an unsupervised deep stitching method for tractortrailer surround view system. We introduce a feature extraction module for tractor-trailer scenarios (FMT) to enhance feature extraction in low-overlap situations. Besides, we design a spatiotemporally consistent control point constraint strategy (STCC) to achieve spatial shape preservation and temporal smoothing effects, resulting in visually consistent and stable stitched sequences. Experimental results from both public and real dataset show that our method efficiently completes tractortrailer surround view stitching, producing well-aligned and natural panoramic images compared to previous methods.

08:55-09:00, Paper WeAT4.6	Add to My Program
Think Step by Step: Chain-Of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Shao, Zhimin	Tsinghua University
Xu, Jialang	University College London
Stoyanov, Danail	University College London
Mazomenos, Evangelos	UCL
Jin, Yueming	National University of Singapore
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Visual Learning Abstract: Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy. The code is available at https://github.com/jinlab-imvr/Chain-of-Gesture.


WeAT5 Regular Session, 305	Add to My Program
Aerial Manipulation 1

Chair: Loianno, Giuseppe	New York University
Co-Chair: Ollero, Anibal	AICIA. G41099946

08:30-08:35, Paper WeAT5.1	Add to My Program
The Palletrone Cart: Human-Robot Interaction-Based Aerial Cargo Transportation

Park, Geonwoo	Seoul National University of Science and Technology
Park, Hyungeun	Seoul National University of Science and Technology
Park, Wooyong	Seoul National University of Science and Technology
Lee, Dongjae	Seoul National University
Kim, Murim	Korea Institute of Robot and Convergence
Lee, Seung Jae	Seoul National University of Science and Technology
Keywords: Aerial Systems: Mechanics and Control, Physical Human-Robot Interaction, Aerial Systems: Applications Abstract: This paper presents a new cargo transportation solution based on physical human-robot interaction utilizing a novel fully-actuated multirotor platform called Palletrone. The platform is designed with a spacious upper flat surface for easy cargo loading, complemented by a rear-mounted handle reminiscent of a shopping cart. Flight trajectory control is achieved by a human operator gripping the handle and applying three-dimensional forces and torques while maintaining a stable cargo transport with zero roll and pitch attitude throughout the flight. To facilitate physical human-robot interaction, we employ an admittance control technique. Instead of relying on complex force estimation methods, like in most admittance control implementations, we introduce a simple yet effective estimation technique based on a disturbance observer robust control algorithm. We conducted an analysis of the flight stability and performance in response to changes in system mass resulting from arbitrary cargo loading. Ultimately, we demonstrate that individuals can effectively control the system trajectory by applying appropriate interactive forces and torques. Furthermore, we showcase the performance of the system through various experimental scenarios.

08:35-08:40, Paper WeAT5.2	Add to My Program
Design of a Suspended Manipulator with Aerial Elliptic Winding

Niddam, Ethan	University of Strasbourg, ICube
Dumon, Jonathan	GIPSA-LAB
Cuvillon, Loic	University of Strasbourg
Durand, Sylvain	INSA Strasbourg & ICube
Querry, Stephane	Polyvionics
Hably, Ahmad	Grenoble-Inp
Gangloff, Jacques	University of Strasbourg
Keywords: Aerial Systems: Mechanics and Control, Art and Entertainment Robotics, Tendon/Wire Mechanism Abstract: Art is one of the oldest forms of human expression, constantly evolving, taking new forms and using new techniques. With their increased accuracy and versatility, robots can be considered as a new class of tools to perform works of art. The STRAD (STReet Art Drone) project aims to perform a 10-meter- high painting on a vertical surface with sub-centimetric precision. To achieve this goal we introduce a new design for an aerial manipulator with elastic suspension capable of moving from one equilibrium position to another using only its thrusters and an elliptic pulley-counterweight system. A feedback linearization control law is implemented to perform fast and accurate winding and unwinding of an elastic cable.

08:40-08:45, Paper WeAT5.3	Add to My Program
Autonomous Heavy Object Pushing Using a Coaxial Tiltrotor (I)

Hwang, Sunwoo	Seoul National University
Lee, Dongjae	Seoul National University
Kim, Changhyeon	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Mobile Manipulation Abstract: Aerial physical interaction (APhI) with a multirotor-based platform such as pushing a heavy object demands generation of a sufficiently large interaction force while maintaining the stability. Such requirement can cause rotor saturation, because the rotor thrust enlarged for interaction force may leave a reduced margin for attitude stabilization. We first design an H -shaped coaxial tiltrotor that can generate a sufficiently large interaction force than a conventional multirotor. We then propose an overall framework composed of high-level robust controller and low-level control allocation for the coaxial tiltrotor to ensure robustness against uncertain motion of the unknown interacting object and to overcome the saturation issue. To guarantee the robustness at all time, we design a controller based on a nonlinear disturbance observer (DOB). Then, we formulate a problem of computing low-level actuator inputs avoiding rotor saturation as a tractable nonlinear optimization problem, which can be solved real-time. The proposed framework is validated in extensive real-world experiments where the 3.3 kg tiltrotor successfully pushes a cart weighing up to 60 kg. An ablation study with the tiltrotor shows effectiveness of the proposed control allocation law in avoiding rotor saturation. Furthermore, a comparative experiment with a conventional multirotor shows failure in the same setting, which validates the use of the coaxial tiltrotor. An experimental video can be found at htt

08:45-08:50, Paper WeAT5.4	Add to My Program
Aerial Grasping by Multi-Limbed Flying Robot SPIDAR Based on Vectored Thrust Control

Zhao, Moju	The University of Tokyo
Keywords: Aerial Systems: Applications, Grasping, Motion Control Abstract: Delivery by aerial robots is an emerging topic in many scenarios, such as logistics, construction industry, and disaster response. Compared to the standard styles that deploy cage or sling, grasping style by gripper can handle objects in various shapes. A multi-limbed structure with distributed vectorable rotors called SPIDAR shows a higher potential to grasp large object in a three-dimensional manner. Therefore, in this paper, we focus on the advanced usage of the vectored thrust forces to achieve aerial grasping by this robot. First, a vectored thrust control to avoid the aerointerference on the underwind segments (e.g., grasped object) during ﬂight is proposed. Then, an optimization-based planning method that utilizes redundant vectored thrust forces for ﬁrm grasping is developed. Finally, we demonstrate the feasibility of the proposed ﬂight control and grasp planning by performing challenging grasping and transporting motion with a spherical object of which the diameter is 0.6m. To the best of our knowledge, this work is the ﬁrst to achieve multi-ﬁnger-like grasping to carry a large object in midair.

08:50-08:55, Paper WeAT5.5	Add to My Program
Hook-Based Aerial Payload Grasping from a Moving Platform

Antal, Peter	Institute for Computer Science and Control (SZTAKI)
Péni, Tamás	SZTAKI Institute for Computer Science and Control
Toth, Roland	Eindhoven University of Technology (TU/e)
Keywords: Aerial Systems: Applications, Motion and Path Planning, Planning under Uncertainty Abstract: This paper investigates payload grasping from a moving platform using a hook-equipped aerial manipulator. First, a computationally efficient trajectory optimization based on complementarity constraints is proposed to determine the optimal grasping time. To enable application in complex, dynamically changing environments, the future motion of the payload is predicted using a physics simulator-based model. The success of payload grasping under model uncertainties and external disturbances is formally verified through a robustness analysis method based on integral quadratic constraints. The proposed algorithms are evaluated in a high-fidelity physical simulator, and in real flight experiments using a custom-designed aerial manipulator platform.

08:55-09:00, Paper WeAT5.6	Add to My Program
Human-Aware Physical Human-Robot Collaborative Transportation and Manipulation with Multiple Aerial Robots

Li, Guanrui	Worcester Polytechnic Institute
Xinyang, Liu	New York University
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Cooperating Robots, Physical Human-Robot Interaction Abstract: Human-robot interaction will play an essential role in various industries and daily tasks, enabling robots to effectively collaborate with humans and reduce their physical workload. This paper proposes a novel approach for physical human- robot collaborative transportation and manipulation of a cable- suspended payload with multiple aerial robots. The proposed method enables smooth and intuitive interaction between the transported objects and a human worker. In the same time, we consider distance constraints during the operations by exploiting the internal redundancy of the multi-robot transportation system. We validate the approach through extensive simulation and real-world experiments. These include scenarios where the robot team assists the human in transporting and manipulating a load, or where the human helps the robot team navigate the environment. We experimentally demonstrate for the first time, to the best of our knowledge, that our approach enables a quadrotor team to physically collaborate with a human in manipulating a payload in all 6 DoF in collaborative human- robot transportation and manipulation tasks.


WeAT6 Regular Session, 307	Add to My Program
Vision-Based Navigation 1

Chair: Zhang, Fumin	Hong Kong University of Science and Technology
Co-Chair: Tzes, Anthony	New York University Abu Dhabi

08:30-08:35, Paper WeAT6.1	Add to My Program
VLN-KHVR: Knowledge-And-History Aware Visual Representation for Continuous Vision-And-Language Navigation

Kong, Ping	Tianjin University
Liu, Ruonan	Shanghai Jiao Tong University
Xie, Zongxia	Tianjin University
Pang, Zhibo	KTH Royal Institute of Technology
Keywords: Vision-Based Navigation Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate with low-level actions following natural language instructions in 3D environments. Most existing approaches utilize observation features from the current step to represent the viewpoint. However, these representations often conflate redundant and essential information for navigation, introducing ambiguity into the agent's action prediction. To address the problem of inadequate representation, we propose a Knowledge-and-History Aware Visual Representation for Continuous Vision-and-Language Navigation (VLN-KHVR). The proposed approach constructs enriched visual representations tailored to navigation instructions, enhancing agents’ navigation performance. Specifically, VLN-KHVR extracts image features from the current observation, retrieves relevant knowledge in the knowledge base, and obtains the history of the navigation episode. Subsequently, the knowledge and history features are filtered to eliminate the information irrelevant to navigation instruction. These refined features are integrated with the instruction for further interaction. Finally, the aggregated features are used to guide navigation. Our model outperforms previous methods on the VLN-CE benchmark, demonstrating the effectiveness of the proposed method.

08:35-08:40, Paper WeAT6.2	Add to My Program
LiteVLoc: Map-Lite Visual Localization for Image Goal Navigation

Jiao, Jianhao	University College London
He, Jinhao	The Hong Kong University of Science and Technology (Guangzhou)
Liu, Changkun	The Hong Kong University of Science and Technology
Aegidius, Sebastian	University College London
Hu, Xiangcheng	Hong Kong University of Science and Technology
Braud, Tristan	HKUST
Kanoulas, Dimitrios	University College London
Keywords: Localization, Vision-Based Navigation, SLAM Abstract: This paper presents LiteVLoc, a hierarchical vi-sual localization framework that uses a lightweight topo-metric map to represent the environment. The method consists of three sequential modules that estimate camera poses in a coarse-to-fine manner. Unlike dense 3D mapping methods, LiteVLoc reduces storage by avoiding geometric reconstruction. It uses a learning-based feature matcher to establish dense corre-spondences between sparse keyframes and observations, and then refines poses with a geometric solver, enabling robustness to viewpoint changes. The system assumes depth sensors or stereo camera for deployment. A novel dataset for the map-free relocalization task is also introduced. Extensive experiments including localization and navigation in both simulated and real-world scenarios have validate the system’s performance and demonstrated its precision and efficiency for large-scale de-ployment. Code and data will be made publicly available at the webpage: https://rpl-cs-ucl.github.io/LiteVLoc.

08:40-08:45, Paper WeAT6.3	Add to My Program
BEINGS: Bayesian Embodied Image-Goal Navigation with Gaussian Splatting

Meng, Wugang	Hong Kong University of Science and Technology
Wu, Tianfu	Hong Kong University of Science and Technology
Yin, Huan	Hong Kong University of Science and Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Keywords: Vision-Based Navigation, Search and Rescue Robots, Probabilistic Inference Abstract: Image-goal navigation enables a robot to reach the location where a target image was captured, using visual cues for guidance. However, current methods either rely heavily on data and computationally expensive learning-based approaches or lack efficiency in complex environments due to insufficient exploration strategies. To address these limitations, we propose Bayesian Embodied Image-goal Navigation Using Gaussian Splatting, a novel method that formulates ImageNav as an optimal control problem within a model predictive control framework. BEINGS leverages 3D Gaussian Splatting as a scene prior to predict future observations, enabling efficient, real-time navigation decisions grounded in the robot’s sensory experiences. By integrating Bayesian updates, our method dynamically refines the robot's strategy without requiring extensive prior experience or data. Our algorithm is validated through extensive simulations and physical experiments, showcasing its potential for embodied robot systems in visually complex scenarios. Project Page: www.mwg.ink/BEINGS-web.

08:45-08:50, Paper WeAT6.4	Add to My Program
FLAF: Focal Line and Feature-Constrained Active View Planning for Visual Teach and Repeat

Fu, Changfei	SUSTech
Chen, Weinan	Guangdong University of Technology
Xu, Wenjun	Peng Cheng Laboratory
Zhang, Hong	SUSTech
Keywords: View Planning for SLAM, Vision-Based Navigation, SLAM Abstract: This paper presents FLAF, a focal line and feature-constrained active view planning method for tracking failure avoidance in feature-based visual navigation of mobile robots. FLAF is built on a feature-based visual teach and repeat (VT&R) framework, which supports robotic applications by teaching robots to cruise various paths that fulfill many daily autonomous navigation requirements. However, tracking failures in feature-based Visual Simultaneous Localization and Mapping (VSLAM), particularly in textureless regions common in human-made environments, poses a significant challenge to the real-world deployment of VT&R. To address this problem, the proposed view planner is integrated into a feature-based VSLAM system, creating an active VT&R solution that mitigates tracking failures. Our system features a Pan-Tilt Unit (PTU)-based active mounted on a mobile robot. Using FLAF, the active camera-based VSLAM (AC-SLAM) operates during the teaching phase to construct a complete path map and in the repeating phase to maintain stable localization. FLAF actively directs the camera toward more map points to avoid mapping failures during path learning and toward more feature-identifiable map points while following the learned trajectory. Experimental results in real scenarios show that FLAF significantly outperforms existing methods by accounting for feature identifiability, particularly the view angle of the features. While effectively dealing with low-texture regions in active view planning, considering feature identifiability enables our active VT&R system to perform well in challenging environments.

08:50-08:55, Paper WeAT6.5	Add to My Program
Ground-Level Viewpoint Vision-And-Language Navigation in Continuous Environments

Li, Zerui	Adelaide University
Zhou, Gengze	University of Adelaide
Hong, Haodong	The University of Queensland
Shao, Yanyan	Zhejiang University of Technology
Lyu, Wenqi	The University of Adelaide
Qiao, Yanyuan	The University of Adelaide
Wu, Qi	University of Adelaide
Keywords: Deep Learning Methods, Vision-Based Navigation Abstract: Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment is still challenging. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.

08:55-09:00, Paper WeAT6.6	Add to My Program
NavTr: Object-Goal Navigation with Learnable Transformer Queries

Mao, Qiuyu	University of Science and Technology of China
Jikai, Wang	University of Science and Technology of China, Department of Aut
Xu, Meng	University of Science and Technology of China
Chen, Zonghai	University of Sciences and Technology of China
Keywords: Vision-Based Navigation, Representation Learning, Reinforcement Learning Abstract: This paper introduces Navigation Transformer (NavTr), a novel framework for object-goal navigation using Transformer queries to enhance the learning and representation of environment states. By integrating semantic information, object positions, and neighborhood information, NavTr creates a unified, comprehensive, and extensible state representation for the object-goal navigating task. In the framework, the Transformer queries implicitly learn inter-object relationships, which facilitates high-level understanding of the environment. Additionally, NavTr implements target-oriented supervisory signals, such as rotation rewards and spatial loss, which improve exploration efficiency in the reinforcement learning framework. NavTr outperforms popular graph-based and Attention-based methods by a large margin in terms of success rate (SR) and success weighted by path length (SPL). Extensive experiments on the AI2-THOR dataset demonstrate the effectiveness of our approach.


WeAT7 Regular Session, 309	Add to My Program
Marine Robotics 3

Chair: Rekleitis, Ioannis	University of Delaware
Co-Chair: Drupt, Juliette	University of Montpellier

08:30-08:35, Paper WeAT7.1	Add to My Program
Shape BoW: Generalized Bag of Words for Appearance-Based Loop Closure Detection in Bathymetric SLAM

Zhang, Qianyi	Korea Advanced Institute of Science and Technology
Kim, Jinwhan	KAIST
Keywords: Marine Robotics, Autonomous Vehicle Navigation, SLAM Abstract: Existing bathymetric simultaneous localization and mapping (SLAM) methods predominantly rely on odometry information for loop closure detection, which has a deteriorating performance when handling unreliable odometry data or conducting large-scale mapping missions. This letter introduces a novel generalized Bag of Words (BoW) named Shape BoW (S-BoW) for appearance-based loop closure detection in bathymetric SLAM. S-BoW is trained from the collection of the terrain gradient features extracted from existing bathymetric datasets and can be used in various bathymetric scenarios. We integrated the loop closure detection method using S-BoW into a feature-based bathymetric SLAM method called TTT SLAM, and we evaluated its performance against three existing bathymetric SLAM methods using two datasets. The results indicate that S-BoW not only serves as a generalized BoW but also enhances the efficiency of the integrated SLAM method, achieving accuracy comparable to the original TTT SLAM while offering a 37% speed improvement in a large-scale sea trial dataset. To the best of our knowledge, S-BoW is the first generalized BoW that can be used to realize effective appearance-based loop closure detection in bathymetric SLAM.

08:35-08:40, Paper WeAT7.2	Add to My Program
ODYSSEE: Oyster Detection Yielded by Sensor Systems on Edge Electronics

Lin, Xiaomin	University of Maryland
Mange, Vivek Dharmesh	University of Delaware
Suresh, Arjun	University of Maryland, College Park
Palnitkar, Aadi	University of Maryland College Park
Neuberger, Bernhard	TU Wien
Campbell, Brendan	University of Delaware School of Marine Science and Policy
Williams, Alan	University of Maryland Center for Environmental Science
Baxevani, Kleio	University of Delaware
Mallette, Jeremy	Independent Robotics
Vera Gonzalez, Alhim Adonai	University of Cincinnati
Vincze, Markus	Vienna University of Technology
Rekleitis, Ioannis	University of Delaware
Tanner, Herbert G.	University of Delaware
Aloimonos, Yiannis	University of Maryland
Keywords: Marine Robotics, Recognition, Data Sets for Robot Learning Abstract: Oysters are an important keystone species in coastal ecosystems that provide several economic, environmental, and cultural benefits. Given the array of utilities derived from oysters, the application of autonomous robotic systems for oyster detection and monitoring grows increasingly relevant. However, current monitoring strategies for assessing oyster assemblages are mostly destructive. While manually identifying and monitoring oysters from video footage is nondestructive, is it tedious and requires expert input. An alternative to human monitoring is deploying trained object detection models on edge devices, such as the Aqua2 robot, to enable real-time monitoring of oysters directly in the field. Yet training these models to maximum efficacy requires an extensive dataset that accurately represents the domain, and it is difficult to obtain such high-quality training data due to the complications inherent to underwater environments. To address these complications, we introduce a novel method leveraging stable diffusion to generate high-quality synthetic data for the marine domain. We exploit diffusion models to create photorealistic oyster imagery, using ControlNet inputs to ensure consistency with the segmentation ground-truth mask, the geometry of the scene, and the target domain of real oyster images. This large dataset is used to train a vision model, specifically based on YOLOv10. The trained model is then deployed and tested on an edge platform, the Aqua2, in an underwater robotics system. We achieve state-of-the-art (0.657 mAP@50) for oyster detection, which can pave the way for autonomous oyster habitat monitoring and increase the efficiency of on-bottom oyster aquaculture

08:40-08:45, Paper WeAT7.3	Add to My Program
IBURD: Image Blending for Underwater Robotic Detection

Hong, Jungseok	MIT
Singh, Sakshi	University of Minnesota
Sattar, Junaed	University of Minnesota
Keywords: Marine Robotics, Data Sets for Robotic Vision, Visual Learning Abstract: We present an image blending pipeline, IBURD, that creates realistic synthetic images to assist in the training of deep detectors for use on underwater autonomous vehicles (AUVs) for marine debris detection tasks. Specifically, IBURD generates both images of underwater debris and their pixel-level annotations, using source images of debris objects, their annotations, and target background images of marine environments. With Poisson editing and style transfer techniques, IBURD is even able to robustly blend transparent objects into arbitrary backgrounds and automatically adjust the style of blended images using the blurriness metric of target background images. These generated images of marine debris in actual underwater backgrounds address the data scarcity and data variety problems faced by deep-learned vision algorithms in challenging underwater conditions, and can enable the use of AUVs for environmental cleanup missions. Both quantitative and robotic evaluations of IBURD demonstrate the efficacy of the proposed approach for robotic detection of marine debris.

08:45-08:50, Paper WeAT7.4	Add to My Program
3DSSDF: Underwater 3D Sonar Reconstruction Using Signed Distance Functions

Archieri, Simon	Heriot-Watt University
Drupt, Juliette	University of Montpellier
Cinar, Ahmet Fatih	Frontier Robotics
Grimaldi, Michele	University of Girona
Carlucho, Ignacio	University of Edinburgh
Scharff Willners, Jonatan	Heriot-Watt University
Petillot, Yvan R.	Heriot-Watt University
Keywords: Marine Robotics, Mapping Abstract: Underwater autonomous robotic operations require online localization and 3D mapping. Because of the absence of absolute positioning underwater, these tasks strongly rely on embedded sensors, including proprioceptive or navigation sensors — which can be fused for an odometry, — and exteroceptive sensors. One of the most popular exteroceptive sensors for underwater is the imaging sonar, which emits a large fan-shaped acoustic signal and estimates the position of the surrounding obstacles from a measure of the reflected signal. This paper addresses underwater online localization and 3D mapping using a forward looking, wide-aperture imaging sonar and vehicle’s intrinsic navigation estimates. We introduce 3DSSDF (3D Sonar Reconstruction Using Signed Distance Functions), a new localization and 3D mapping algorithm based on signed distance functions, which is evaluated in simulation and on real data, in man-made and natural environments. Comparisons to reference trajectories and maps demonstrate that, in our tests, 3DSSDF efficiently corrects navigation drift and that trajectory and map accuracy is always below 1 m and below 1% of the distanced travelled, which can be sufficient for the safe inspection of natural or artificial underwater structures.

08:50-08:55, Paper WeAT7.5	Add to My Program
Cascade IPG Observer for Underwater Robot State Estimation

Joshi, Kaustubh	University of Maryland College Park
Liu, Tianchen	University of Maryland, College Park
Chopra, Nikhil	University of Maryland, College Park
Keywords: Marine Robotics, Localization, Sensor Fusion Abstract: This paper presents a novel cascade nonlinear observer framework for inertial state estimation. It tackles the problem of intermediate state estimation when external localization is unavailable or in the event of a sensor outage. The proposed observer comprises two nonlinear observers based on a recently developed iteratively preconditioned gradient descent (IPG) algorithm. It takes the inputs via an IMU preintegration model where the first observer is a quaternion-based IPG. The output for the first observer is the input for the second observer, estimating the velocity and, consequently, the position. The proposed observer is validated on a public underwater dataset and a real-world experiment using our robot platform. The estimation is compared with an extended Kalman filter (EKF) and an invariant extended Kalman filter (InEKF). Results demonstrate that our method outperforms these methods regarding better positional accuracy and lower variance.

08:55-09:00, Paper WeAT7.6	Add to My Program
ResiVis: A Holistic Underwater Motion Planning Approach for Robust Active Perception under Uncertainties

Xanthidis, Marios	SINTEF Ocean
Skaldebø, Martin	SINTEF Ocean
Haugaløkken, Bent	SINTEF Ocean
Evjemo, Linn Danielsen	SINTEF Ocean AS
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Kelasidi, Eleni	NTNU
Keywords: Marine Robotics, Planning under Uncertainty, Collision Avoidance Abstract: Motion planning for autonomous active perception in cluttered environments remains a challenging problem, requiring real-time solutions that both maximize safety and achieve a desired behavior. In dynamic underwater environments, such as in aquaculture operations, the robots are additionally expected to deal with state and motion uncertainty and errors, dynamic and deformable obstacles, currents, and disturbances. Previous work has introduced real-time frameworks that provided safe navigation in cluttered environments, active perception in static environments, and robust navigation in uncertain dynamic environments. This paper introduces a new real-time approach called ResiVis, which leverages the best aspects of the aforementioned techniques along with a new formulation that further enhances underwater autonomy by enabling active perception of static and dynamic target objects from desired distances. The proposed method utilizes path-optimization for real-time response with constraints guaranteeing continuous collision safety, and computes paths with clearance adaptive to both the conditions of the environments and the performance of the path follower. An improved new constraint encourages observations of dynamic objects with the planner adapting to satisfy desired observation distances and their projected future positions. ResiVis is validated with challenging simulation experiments and with hardware-in-the-loop trials in real industrial-scale aquaculture facilities.


WeAT8 Regular Session, 311	Add to My Program
Planinng and Control for Legged Robots 1

Chair: Gan, Zhenyu	Syracuse University
Co-Chair: Remy, C. David	University of Stuttgart

08:30-08:35, Paper WeAT8.1	Add to My Program
Energy-Optimal Asymmetrical Gait Selection for Quadrupedal Robots

Alqaham, Yasser G.	Syracuse University
Cheng, Jing	Syracuse University
Gan, Zhenyu	Syracuse University
Keywords: Legged Robots, Optimization and Optimal Control, Dynamics Abstract: Symmetrical gaits, such as trotting, are com- monly employed in quadrupedal robots for their simplicity and stability. However, the potential of asymmetrical gaits, such as bounding and galloping—which are prevalent in their natural counterparts at high speeds or over long distances—is less clear in the design of locomotion controllers for legged machines. In these asymmetrical gaits, the system dynamics are more complex because the front and rear leg pairs exhibit different motions, which are coupled by the rotational motion of the torso. This study systematically examines five distinct asymmetrical quadrupedal gaits on a legged robot, aiming to uncover the fundamental differences in footfall sequences and the consequent energetics across a broad range of speeds. Utilizing a full-body model of a quadrupedal robot (Unitree A1), we developed a hybrid system for each gait, incorporating the desired footfall sequence and rigid impacts. To identify the most energy-optimal gait, we applied optimal control methods, framing it as a trajectory optimization problem with specific constraints and a work-based cost of transport as an objective function. Our results show that, in the context of asymmetrical gaits, when minimizing cost of transport across the entire stride, the front leg pair primarily propels the system forward, while the rear leg pair acts more like an inverted pendulum, contributing significantly less to the energetic output. Addi- tionally, while bounding—characterized by two aerial phases per cycle—is the most energy-optimal gait at higher speeds, the energy expenditure of gaits at speeds below 1 m/s depend heavily on the robot’s specific design.

08:35-08:40, Paper WeAT8.2	Add to My Program
Bipedal Walking with Continuously Compliant Robotic Legs

Bendfeld, Robin	University of Stuttgart
Remy, C. David	University of Stuttgart
Keywords: Legged Robots, Compliant Joints and Mechanisms, Motion Control Abstract: In biomechanics and robotics, elasticity plays a crucial role in enhancing locomotion efficiency and stability. Traditional approaches in legged robots often employ series elastic actuators (SEA) with discrete rigid components, which, while effective, add weight and complexity. This paper presents an innovative alternative by integrating continuously compliant structures into the lower legs of a bipedal robot, fundamentally transforming the SEA concept. Our approach replaces traditional rigid segments with lightweight, deformable materials, reducing overall mass and simplifying the actuation design. This novel design introduces unique challenges in modeling, sensing, and control, due to the infinite dimensionality of continuously compliant elements. We address these challenges through effective approximations and control strategies. The paper details the design and modeling of the compliant leg structure, presents low-level force and kinematics controllers, and introduces a high-level posture controller with a gait scheduler. Experimental results demonstrate successful bipedal walking using this new design.

08:40-08:45, Paper WeAT8.3	Add to My Program
Optimal Torque Distribution Via Dynamic Adaptation for Quadrupedal Locomotion on Slippery Terrains

Argiropoulos, Despina-Ekaterini	(a) Institute of Computer Science Foundation for Research and T
Maravgakis, Michael	Foundation for Research and Technology - Hellas (FORTH)
Tian, Changda	FORTH
Papageorgiou, Dimitrios	Hellenic Mediterranean University
Trahanias, Panos	Foundation for Research and Technology – Hellas (FORTH)
Keywords: Legged Robots, Robust/Adaptive Control, Multi-Contact Whole-Body Motion Planning and Control Abstract: As legged robots continue to evolve, new control methods are being developed to provide fast, robust, accurate and computationally efficient algorithms for traversing challenging environments. This paper presents a real-time adaptive locomotion controller for quadrupeds, designed to maintain stability and controllability on various surfaces, including highly slippery terrains. The proposed approach optimizes control effort distribution based on the probability of slippage by utilizing a surface-independent adaptation layer. By balancing the robot's redundant kinematic system through rank relaxation—similar to loosening constraints in optimization problems—this method demonstrates significant performance improvements. Unlike Reinforcement Learning (RL) approaches, which depend on pre-trained policies and may struggle to adapt velocity tracking control across different terrains, our method rapidly adjusts to changing conditions, as validated by extensive simulation experiments.

08:45-08:50, Paper WeAT8.4	Add to My Program
Adaptive Energy Regularization for Autonomous Gait Transition and Energy-Efficient Quadruped Locomotion

Liang, Boyuan	University of California, Berkeley
Sun, Lingfeng	University of California, Berkeley
Zhu, Xinghao	University of California, Berkeley
Zhang, Bike	University of California, Berkeley
Xiong, Ziyin	Peking University
Wang, Yixiao	University of California, Berkeley
Li, Chenran	University of California, Berkeley
Sreenath, Koushil	University of California, Berkeley
Tomizuka, Masayoshi	University of California
Keywords: Legged Robots, Reinforcement Learning, Natural Machine Motion Abstract: In reinforcement learning for legged robot locomotion, crafting effective reward strategies is crucial. Predefined gait patterns and complex reward systems are widely used to stabilize policy training. Drawing from the natural locomotion behaviors of humans and animals, which adapt their gaits to minimize energy consumption, we investigate the impact of incorporating an energy-efficient reward term that prioritizes distance-averaged energy consumption into the reinforcement learning framework. Our findings demonstrate that this simple addition enables quadruped robots to autonomously select appropriate gaits—such as four-beat walking at lower speeds and trotting at higher speeds—without the need for explicit gait regularizations. Furthermore, we provide a guideline for tuning the weight of this energy-efficient reward, facilitating its application in real-world scenarios. The effectiveness of our approach is validated through simulations and on a real Unitree Go1 robot. This research highlights the potential of energy-centric reward functions to simplify and enhance the learning of adaptive and efficient locomotion in quadruped robots. Videos and more details are at https://sites.google.com/berkeley.edu/efficient-locomotion.

08:50-08:55, Paper WeAT8.5	Add to My Program
Music-Driven Legged Robots: Synchronized Walking to Rhythmic Beats

Hou, Taixian	FuDan University
Zhang, Yueqi	Fudan University
Wei, Xiaoyi	Fudan University
Dong, Zhiyan	Fudan University
Yi, Jiafu	Hainan University
Zhai, Peng	Fudan University
ZHang, Lihua	Fudan University
Keywords: Legged Robots, Reinforcement Learning, Biomimetics Abstract: We address the challenge of effectively controlling the locomotion of legged robots by incorporating precise frequency and phase characteristics, which is often ignored in locomotion policies that do not account for the periodic nature of walking. We propose a hierarchical architecture that integrates a low-level phase tracker, oscillators, and a high-level phase modulator. This controller allows quadruped robots to walk in a natural manner that is synchronized with external musical rhythms. Our method generates diverse gaits across different frequencies and achieves real-time synchronization with music in the physical world. This research establishes a foundational framework for enabling real-time execution of accurate rhythmic motions in legged robots. The video and code are available at https://music-walker.github.io/.

08:55-09:00, Paper WeAT8.6	Add to My Program
Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control

Lu, Chenhao	Tsinghua University
Cheng, Xuxin	University of California, San Diego
Li, Jialong	UCSD
Yang, Shiqi	The Chinese University of Hong Kong, Shenzhen
Ji, Mazeyu	UCSD
Yuan, Chengjing	University of California, San Diego
Yang, Ge	Massachusetts Institute of Technology
Yi, Sha	UC San Diego
Wang, Xiaolong	UC San Diego
Keywords: Humanoid Robot Systems, Sensorimotor Learning, Representation Learning Abstract: Humanoid robots require both robust lower-body locomotion and precise upper-body manipulation. While recent Reinforcement Learning (RL) approaches provide whole-body loco-manipulation policies, they lack precise manipulation with high DoF arms. In this paper, we propose decoupling upper-body control from locomotion, using inverse kinematics (IK) and motion retargeting for precise manipulation, while RL focuses on robust lower-body locomotion. We introduce PMP (Predictive Motion Priors), trained with Conditional Variational Autoencoder (CVAE) to effectively represent upper-body motions. The locomotion policy is trained and conditioned on this upper-body motion representation, ensuring that the system remains robust with both manipulation and locomotion. We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation. With precise upper-body motion and robust lower-body locomotion control, operators can remotely control the humanoid to walk around and explore different environments, while performing diverse manipulation tasks.


WeAT9 Regular Session, 312	Add to My Program
Multi-Robot Planning and Navigation

Chair: Nieto-Granda, Carlos	DEVCOM U.S. Army Research Laboratory
Co-Chair: Moore, Joseph	Johns Hopkins University

08:30-08:35, Paper WeAT9.1	Add to My Program
Distributed Safe Navigation of Multi-Agent Systems Using Control Barrier Function-Based Controllers

Mestres, Pol	University of California, San Diego
Nieto-Granda, Carlos	DEVCOM U.S. Army Research Laboratory
Cortes, Jorge	University of California, San Diego
Keywords: Multi-Robot Systems, Collision Avoidance, Optimization and Optimal Control Abstract: This paper proposes a distributed controller synthesis framework for safe navigation of multi-agent systems. We leverage control barrier functions to formulate collision avoidance with obstacles and teammates as constraints on the control input for a state-dependent network optimization problem that encodes team formation and the navigation task. Our algorithmic solution is valid under general assumptions for nonlinear dynamics and state-dependent network optimization problems with convex constraints and strongly convex objectives. The resulting controller is distributed, satisfies the safety constraints at all times, and asymptotically converges to the solution of the state-dependent network optimization problem. We illustrate its performance in a team of differential-drive robots in a variety of complex environments, both in simulation and in hardware.

08:35-08:40, Paper WeAT9.2	Add to My Program
Hybrid Decision Making for Scalable Multi-Agent Navigation: Integrating Semantic Maps, Discrete Coordination, and Model Predictive Control

de Vos, Koen	Eindhoven University of Technology
Torta, Elena	Eindhoven University of Technology
Bruyninckx, Herman	KU Leuven
López Martínez, César Augusto	Eindhoven University of Technology
van de Molengraft, Marinus Jacobus Gerardus	University of Technology Eindhoven
Keywords: Multi-Robot Systems, Cooperating Robots, Constrained Motion Planning Abstract: This paper presents a framework for multi-agent navigation in structured but dynamic environments, integrating three key components: a shared semantic map encoding metric and semantic environmental knowledge, a claim policy for coordinating access to areas within the environment, and a Model Predictive Controller for generating motion trajectories that respect environmental and coordination constraints. The main advantages of this approach include: (i) enforcing area occupancy constraints derived from specific task requirements; (ii) enhancing computational scalability by eliminating the need for collision avoidance constraints between robotic agents; and (iii) the ability to anticipate and avoid deadlocks between agents. The paper includes both simulations and physical experiments demonstrating the framework’s effectiveness in various representative scenarios

08:40-08:45, Paper WeAT9.3	Add to My Program
Decentralized Nonlinear Model Predictive Control for Safe Collision Avoidance in Quadrotor Teams with Limited Detection Range

Goarin, Manohari	New York University, Tandon School of Engineering
Li, Guanrui	Worcester Polytechnic Institute
Saviolo, Alessandro	New York University
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Applications, Distributed Robot Systems, Collision Avoidance Abstract: Multi-quadrotor systems face significant challenges in decentralized control, particularly with safety and coordination under sensing and communication limitations. State-of-the-art methods leverage Control Barrier Functions (CBFs) to provide safety guarantees but often neglect actuation constraints and limited detection range. To address these gaps, we propose a novel decentralized Nonlinear Model Predictive Control (NMPC) that integrates Exponential CBFs (ECBFs) to enhance safety and optimality in multi-quadrotor systems. We provide both conservative and practical minimum bounds of the range that preserve the safety guarantees of the ECBFs. We validate our approach through extensive simulations with up to 10 quadrotors and 20 obstacles, as well as real-world experiments with 3 quadrotors. Results demonstrate the effectiveness of the proposed framework in realistic settings, highlighting its potential for reliable quadrotor teams operations.

08:45-08:50, Paper WeAT9.4	Add to My Program
SIGMA: Sheaf-Informed Geometric Multi-Agent Pathfinding

Liao, Shuhao	Beihang University
Xia, Weihang	Zijin Mining
Cao, Yuhong	National University of Singapore
Dai, Weiheng	National University of Singapore
He, Chengyang	National University Singapore
Wu, Wenjun	Beihang University
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Keywords: Deep Learning Methods, Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning Abstract: The Multi-Agent Path Finding (MAPF) problem aims to determine the shortest and collision-free paths for multiple agents in a known, potentially obstacle-ridden environment. It is the core challenge for robotic deployments in large-scale logistics and transportation. Decentralized learning-based approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning-based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short-sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross-dependencies between each other through local consensus and utilize them for tightly cooperative decision-making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self-supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state-of-the-art learning-based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real-world robot experiments.

08:50-08:55, Paper WeAT9.5	Add to My Program
An Efficient NSGA-II-Based Algorithm for Multi-Robot Coverage Path Planning

Foster, Ashley	University of Plymouth
Gianni, Mario	University of Liverpool
Aly, Amir	University of Plymouth
Samani, Hooman	University of the Arts London
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Distributed Robot Systems Abstract: This work presents an algorithm based on the Nondominated Sorting Genetic Algorithm II (NSGA-II) to solve multi-objective offline Multi-Robot Coverage Path Planning (MCPP) problems. The proposed algorithm embeds a donation-mutation operator and a multiple-parent crossover that generates solutions which maintain the longest path while minimizing the average path length. The algorithm also uses a library of elitism-selected high-fitness robot paths, and tournament-selected high min-max fitness paths, to construct high multi-objective fitness offspring. We evaluate the performance of our proposed algorithm against the state-of-the-art NSGA-II extended with an improved Heuristic Genetic Algorithm Crossover, and we demonstrate that for different instances of the MCPP problem, the Pareto-fronts of our proposed algorithm are not dominated by any of the points of the fronts generated by the state-of-the-art NSGA-II. A comparison has also been performed in a virtual environment simulating five drones inspecting three wind turbines. Results show that our approach exhibits a higher convergence rate for higher values of the ratio between the number of points to visit and the number of drones.

08:55-09:00, Paper WeAT9.6	Add to My Program
An Iterative Approach for Heterogeneous Multi-Agent Route Planning with Resource Transportation Uncertainty and Temporal Logic Goals

Cardona, Gustavo A.	Lehigh University
Liang, Kaier	Lehigh University
Vasile, Cristian Ioan	Lehigh University
Keywords: Formal Methods in Robotics and Automation, Planning, Scheduling and Coordination, Multi-Robot Systems Abstract: This paper presents an iterative approach for heterogeneous multi-agent route planning in environments with unknown resource distributions. We focus on a team of robots with diverse capabilities tasked with executing missions specified using Capability Temporal Logic (CaTL), a formal framework built on Signal Temporal Logic to handle spatial, temporal, capability, and resource constraints. The key challenge arises from the uncertainty in the initial distribution and quantity of resources in the environment. To address this, we introduce an iterative algorithm that dynamically balances exploration and task fulfillment. Robots are guided to explore the environment, identifying resource locations and quantities while progressively refining their understanding of the resource landscape. At the same time, they aim to maximally satisfy the mission objectives based on the current information, adapting their strategies as new data is uncovered. This approach provides a robust solution for planning in dynamic, resource-constrained environments, enabling efficient coordination of heterogeneous teams even under conditions of uncertainty. Our method's effectiveness and performance are demonstrated through simulated case studies.


WeAT10 Regular Session, 313	Add to My Program
Multi-Robot Path Planning 1

Chair: Akella, Srinivas	University of North Carolina at Charlotte
Co-Chair: Delgado, Carmen	I2CAT Foundation

08:30-08:35, Paper WeAT10.1	Add to My Program
Connectivity-Preserving Distributed Informative Path Planning for Mobile Robot Networks

Nguyen, Thanh Binh	TAMUCC
Nghiem, Truong Xuan	University of Central Florida
Nguyen, Linh	Federation University Australia
La, Hung	University of Nevada at Reno
Nguyen, Thang	Texas A&M University-Corpus Christi
Keywords: Path Planning for Multiple Mobile Robots or Agents, Integrated Planning and Learning, Distributed Robot Systems Abstract: This letter addresses the distributed informative path planning (IPP) problem for a mobile robot network to optimally explore a spatial field. Each robot is able to gather noisy environmental measurements while navigating the environment and build its own model of a spatial phenomenon using the Gaussian process and local data. The IPP optimization problem is formulated in an informative way through a multi-step prediction scheme constrained by connectivity preservation and collision avoidance. The shared hyperparameters of the local Gaussian process models are also arranged to be optimally computed in the path planning optimization problem. By the use of the proximal alternating direction method of multiplier, the optimization problem can be effectively solved in a distributed manner. It theoretically proves that the connectivity in the network is maintained over time whilst the solution of the optimization problem converges to a stationary point. The effectiveness of the proposed approach is verified in synthetic experiments by utilizing a real-world dataset.

08:35-08:40, Paper WeAT10.2	Add to My Program
A Hierarchical Framework for Solving the Constrained Multiple Depot Traveling Salesman Problem

Yang, Ruixiao	Massachusetts Institute of Technology
Fan, Chuchu	Massachusetts Institute of Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Planning, Scheduling and Coordination, Task Planning Abstract: The Multiple Depot Traveling Salesman Problem (MDTSP) is a variant of the NP-hard Traveling Salesman Problem (TSP) with more than one salesman to jointly visit all destinations, commonly found in task planning in multi-agent robotic systems. Traditional MDTSP overlooks practical constraints like limited battery level and inter-agent conflicts, often leading to infeasible or unsafe solutions in reality. In this work, we incorporate energy and resource consumption constraints to form the Constrained MDTSP (CMDTSP). We design a novel hierarchical framework to obtain high-quality solutions with low computational complexity. The framework decomposes a given CMDTSP instance into manageable sub-problems, each handled individually via a TSP solver and heuristic search to generate tours. The tours are then aggregated and processed through a Mixed-Integer Linear Program (MILP), which contains significantly fewer variables and constraints than the MILP for the exact CMDTSP, to form a feasible solution efficiently. We demonstrate the performance of our framework on both real-world and synthetic datasets. It reaches a mean 12.48% optimality gap and 41.7x speedup over the exact method on common instances and a 5.22%sim14.84% solution quality increase with more than 79.8x speedup over the best baseline on large instances where the exact method times out.

08:40-08:45, Paper WeAT10.3	Add to My Program
Fully Differentiable Adaptive Informative Path Planning

Jakkala, Kalvik	University of North Carolina at Charlotte
Akella, Srinivas	University of North Carolina at Charlotte
Keywords: Path Planning for Multiple Mobile Robots or Agents, Environment Monitoring and Management, Integrated Planning and Learning Abstract: Autonomous robots can survey and monitor large environments. However, these robots often have limited computational and power resources, making it crucial to develop an efficient and adaptive informative path planning (IPP) algorithm. Such an algorithm must quickly adapt to environmental data to maximize the information collected while accommodating path constraints, such as distance budgets and boundary limitations. Current approaches to this problem often rely on maximizing mutual information using methods such as greedy algorithms, Bayesian optimization, and genetic algorithms. These methods can be slow and do not scale well to large or 3D environments. We present an adaptive IPP approach that is fully differentiable, significantly faster than previous methods, and scalable to 3D spaces. Our approach also supports continuous sensing robots, which collect data continuously along the entire path, by leveraging streaming sparse Gaussian processes. Benchmark results on two real-world datasets demonstrate that our approach yields solutions that are on par with or better than baseline methods while being up to two orders of magnitude faster. Additionally, we showcase our adaptive IPP approach in a 3D space using a system-on-chip embedded computer with minimal computational resources. Our code is available in the SGP-Tools Python library with a companion ROS 2 package for deployment on ArduPilot-based robots.

08:45-08:50, Paper WeAT10.4	Add to My Program
Online Informative Motion Planning for Active Information Gathering of a Non-Stationary Gaussian Process

Mao, Kexiang	Shanghai Jiao Tong University
He, Jianping	Shanghai Jiao Tong University
Duan, Xiaoming	Shanghai Jiao Tong University
Keywords: Environment Monitoring and Management, Motion and Path Planning, Reactive and Sensor-Based Planning Abstract: Information gathering focuses on designing strategies for a robot to collect data about a physical process, aiming for accurate field reconstruction. While many recent methods have been proposed to address this problem, they often assume the model of the physical process is a priori known and stationary—assumptions that rarely hold in practice. This paper presents a novel informative motion planning approach for online information gathering of a non-stationary Gaussian process. Our approach comprises two key components: an informative path planner that explores the physical field and an adaptive velocity planner that adjusts the robot's velocity profile exploiting the field's spatial variability. Additionally, we propose a path smoothing and tracking strategy to ensure continuous robot motion. Extensive simulations on a bathymetric mapping task demonstrate the effectiveness of our approach, showing superior performance in reconstructing non-stationary physical fields compared to several baseline methods.

08:50-08:55, Paper WeAT10.5	Add to My Program
REACT: Multi Robot Energy-Aware Orchestrator for Indoor Search and Rescue Critical Tasks

Maresca, Fabio	NEC Laboratories Europe GmbH
Romero, Arnau	I2CAT Foundation
Delgado, Carmen	I2CAT Foundation
Sciancalepore, Vincenzo	NEC Laboratories Europe GmbH
Paradells, Josep	Universitat Politecnica De Catalunya
Costa-Perez, Xavier	NEC Laboratories Europe
Keywords: Search and Rescue Robots, Path Planning for Multiple Mobile Robots or Agents, Robotics in Under-Resourced Settings Abstract: Smart factories enhance production efficiency and sustainability, but emergencies like human errors, machinery failures and natural disasters pose significant risks. In critical situations, such as fires or earthquakes, collaborative robots can assist first-responders by entering damaged buildings and locating missing persons, mitigating potential losses. Unlike previous solutions that overlook the critical aspect of energy management, in this paper we propose REACT, a smart energy-aware orchestrator that optimizes the exploration phase, ensuring prolonged operational time and effective area coverage. Our solution leverages a fleet of collaborative robots equipped with advanced sensors and communication capabilities to explore and navigate unknown indoor environments, such as smart factories affected by fires or earthquakes, with high density of obstacles. By leveraging real-time data exchange and cooperative algorithms, the robots dynamically adjust their paths, minimize redundant movements and reduce energy consumption. Extensive simulations confirm that our approach significantly improves the efficiency and reliability of search and rescue missions in complex indoor environments, improving the exploration rate by 10% over existing methods and reaching a map coverage of 97% under time critical operations, up to nearly 100% under relaxed time constraint.

08:55-09:00, Paper WeAT10.6	Add to My Program
Multi-Agent Ergodic Exploration under Smoke-Based Time-Varying Visibility Constraints

Wittemyer, Elena	Yale University
Rao, Ananya	Carnegie Mellon University
Abraham, Ian	Yale University
Choset, Howie	Carnegie Mellon University
Keywords: Aerial Systems: Perception and Autonomy, Vision-Based Navigation, Path Planning for Multiple Mobile Robots or Agents Abstract: In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility evolves over time as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi-agent drone search of a wildfire, where we use the time-varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations.


WeAT11 Regular Session, 314	Add to My Program
Safe Control 1

Chair: Francis, Jonathan	Bosch Center for Artificial Intelligence
Co-Chair: Aksaray, Derya	Northeastern University

08:30-08:35, Paper WeAT11.1	Add to My Program
DiffTune-MPC: Closed-Loop Learning for Model Predictive Control

Tao, Ran	University of Illinois Urbana-Champaign
Cheng, Sheng	University of Illinois Urbana-Champaign
Wang, Xiaofeng	University of South Carolina
Wang, Shenlong	University of Illinois at Urbana-Champaign
Hovakimyan, Naira	University of Illinois at Urbana-Champaign
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Model Learning for Control Abstract: Model predictive control (MPC) has been applied to many platforms in robotics and autonomous systems for its capability to predict a system's future behavior while incorporating constraints that a system may have. To enhance the performance of a system with an MPC controller, one can manually tune the MPC's cost function. However, it can be challenging due to the possibly high dimension of the parameter space as well as the potential difference between the open-loop cost function in MPC and the overall closed-loop performance metric function. This paper presents DiffTune-MPC, a novel learning method, to learn the cost function of an MPC in a closed-loop manner. The proposed framework is compatible with the scenario where the time interval for performance evaluation and MPC's planning horizon have different lengths. We show the auxiliary problem whose solution admits the analytical gradients of MPC and discuss its variations in different MPC settings, including nonlinear MPCs that are solved using sequential quadratic programming. Simulation results demonstrate the learning capability of DiffTune-MPC and the generalization capability of the learned MPC parameters.

08:35-08:40, Paper WeAT11.2	Add to My Program
Combined Modal Robust Cascade Control for Wheeled Self-Reconfigurable Robots under Drive Failure and Safety Threat

Jiang, Tao	Chongqing University
Wang, Jianxiang	Chongqing University
Zheng, Zhi	Chongqing University
Mo, Rongqin	Chongqing University
Sun, Yizhuo	Harbin Institute of Technology
Keywords: Robot Safety, Motion Control, Robust/Adaptive Control Abstract: 轮式自重构机器人（WSRRs）是一种新型的多机器人系统，具有灵活的配置和任务适应性，在非结构化任务环境中具有广泛的应用前景。该文基于非完整约束和拉格朗日方法，建立了具有任意重配置尺度的 WSRR 的组合模态运动学和动力学。在运动学层面，基于非完整约束，设计了基于安全地理围栏的平滑避障策略来确保安全。在动态层面，引入自适应容错机制，保证合理的扭矩分配，避免跟踪性能下降。同时，该文阐述了一种改进的扩展状态观测器（IESO），通过该算法可以抑制测量噪声的高频振荡和初始观测器误差的峰值现象，实现了未知集总扰动下鲁棒的速度跟踪控制

08:40-08:45, Paper WeAT11.3	Add to My Program
CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios Using Real-World Trajectories

Huang, Peide	Apple Inc
Ding, Wenhao	Carnegie Mellon University
Stoler, Benjamin	Carnegie Mellon University
Francis, Jonathan	Bosch Center for Artificial Intelligence
Chen, Bingqing	Bosch Center for AI
Zhao, Ding	Carnegie Mellon University
Keywords: Robot Safety, Intelligent Transportation Systems, Autonomous Vehicle Navigation Abstract: Simulation is an indispensable tool in the development and testing of autonomous vehicles (AVs), offering an efficient and safe alternative to road testing. An outstanding challenge with simulation-based testing is the generation of safety-critical scenarios, which are essential to ensure that AVs can handle rare but potentially fatal situations. This paper addresses this challenge by introducing a novel framework, CaDRE, to generate realistic, diverse, and controllable safety-critical scenarios. Our approach optimizes for both the quality and diversity of scenarios by employing a unique formulation and algorithm that integrates real-world scenarios, domain knowledge, and black-box optimization. We validate the effectiveness of our framework through extensive testing in three representative types of traffic scenarios. The results demonstrate superior performance in generating diverse and high-quality scenarios with greater sample efficiency than existing reinforcement learning (RL) and sampling-based methods.

08:45-08:50, Paper WeAT11.4	Add to My Program
Certificated Actor-Critic: Hierarchical Reinforcement Learning with Control Barrier Functions for Safe Navigation

Xie, Junjun	Harbin Institute of Technology, Shenzhen, China
Zhao, Shuhao	School of Mechanical Engineering and Automation Harbin Institute
Hu, Liang	Harbin Institute of Technology, Shenzhen
Gao, Huijun	Harbin Institute of Technology
Keywords: Robot Safety, Reinforcement Learning, Machine Learning for Robot Control Abstract: Control Barrier Functions (CBFs) have emerged as a prominent approach to designing safe navigation systems of robots. Despite their popularity, current CBF-based methods exhibit some limitations: optimization-based safe control techniques tend to be either myopic or computationally intensive, and they rely on simplified system models; conversely, the learning-based methods suffer from the lack of quantitative indication in terms of navigation performance and safety. In this paper, we present a new model-free reinforcement learning algorithm called Certificated Actor-Critic (CAC), which introduces a hierarchical reinforcement learning framework and well-defined reward functions derived from CBFs. We carry out theoretical analysis and proof of our algorithm, and propose several improvements in algorithm implementation. Our analysis is validated by two simulation experiments, showing the effectiveness of our proposed CAC algorithm.

08:50-08:55, Paper WeAT11.5	Add to My Program
Exact Imposition of Safety Boundary Conditions in Neural Reachable Tubes

Singh, Aditya	Indian Institute of Technology, Patna
Feng, Zeyuan	Stanford University
Bansal, Somil	Stanford University
Keywords: Robot Safety, Machine Learning for Robot Control Abstract: Hamilton-Jacobi (HJ) reachability analysis is a widely adopted verification tool to provide safety and performance guarantees for autonomous systems. However, it involves solving a partial differential equation (PDE) to compute a safety value function, whose computational and memory complexity scales exponentially with the state dimension, making its direct application to large-scale systems intractable. To overcome these challenges, DeepReach,a recently proposed learning-based approach, approximates high-dimensional reachable tubes using neural networks (NNs). While shown to be effective, the accuracy of the learned solution decreases with system complexity. One of the reasons for this degradation is a soft imposition of safety constraints during the learning process, which corresponds to the boundary conditions of the PDE, resulting in inaccurate value functions. In this work, we propose ExactBC, a variant of DeepReach that imposes safety constraints exactly during the learning process by restructuring the overall value function as a weighted sum of the boundary condition and the NN output. Moreover, the proposed variant no longer needs a boundary loss term during the training process, thus eliminating the need to balance different loss terms. We demonstrate the efficacy of the proposed approach in significantly improving the accuracy of the learned value function for four challenging reachability tasks: a rimless wheel system with state resets, collision avoidance in a cluttered environment, autonomous rocket landing, and multi-aircraft collision avoidance.

08:55-09:00, Paper WeAT11.6	Add to My Program
RelAIBotiX: Reliability Assessment for AI-Controlled Robotic Systems

Grimmeisen, Philipp	University of Stuttgart
Golwalkar, Rucha	University of Lübeck
Sautter, Friedrich	IAS, Uni Stuttgart
Morozov, Andrey	University of Stuttgart
Keywords: Robot Safety, AI-Based Methods, Probability and Statistical Methods Abstract: AI-controlled robotic systems can introduce significant risks to both humans and the environment. Traditional reliability assessment methods fall short in addressing the complexities of these systems, particularly when dealing with black-box or dynamically changing control policies. The traditional approaches are applied manually and do not consider frequent software updates. In this paper, we present RelAIBotiX, a new methodology that enables dynamic and continuous reliability assessment, specifically tailored for robotic systems controlled by AI-Algorithms. RelAIBotiX is a dynamic reliability assessment framework that combines four methods: (i) Skill Detection that automatically identifies executed skills using deep learning techniques, (ii) Behavioral Analysis that creates an operational profile of the robotic system containing information about the skill execution sequence, active components for each skill, and their utilization intensity that influence their failure rate, (iii) Reliability Model Generation that automatically transforms the operational profile and reliability data of robotic hardware components into quantitative hybrid reliability models, and (iv) Reliability Model Solver for the numerical evaluation of the generated reliability models. Our evaluation included computing the reliability of the system, the probability of failure of individual skills, and component sensitivity analysis. We validated the applicability of the proposed framework in five simulative and real-world setups.


WeAT12 Regular Session, 315	Add to My Program
Human-Robot Interaction 3

Chair: Fitter, Naomi T.	Oregon State University
Co-Chair: Yuan, Wenzhen	University of Illinois

08:30-08:35, Paper WeAT12.1	Add to My Program
Adaptive Emotional Expression in Social Robots: A Multimodal Approach to Dynamic Emotion Modeling

Park, Haeun	Ulsan National Institute of Science and Technology
Lee, Jiyeon	Ulsan National Institute of Science and Technology
Lee, Hui Sung	UNIST (Ulsan National Institute of Science and Technology)
Keywords: Emotional Robotics, Gesture, Posture and Facial Expressions, Robot Companions Abstract: Social robots have been extensively studied in recent decades, with many researchers exploring the use of modalities such as facial expressions to achieve more natural emotions in robots. Various methods have been attempted to generate and express robot emotions, including computational models that define an affect space and show dynamic emotion changes. However, the implementation of multimodal expression in previous models is ambiguous, and the generation of emotions in response to stimuli relies on heuristic methods. In this paper, we present a framework that enables robots to naturally express their emotions in a multimodal way, where the emotion can change over time based on the given stimulus values. By representing the robot’s emotion as a position in an affect space of a computational emotion model, we consider the given stimuli values as driving forces that can shift the emotion position dynamically. In order to examine the feasibility of our proposed method, a mobile robot prototype was implemented that can recognize touch and express different emotions with facial expressions and movements. The experiment demonstrated that the emotion elicited by a given stimulus is contingent upon the robot’s previous state, thereby imparting the impression that the robot possesses a distinctive emotion model. Furthermore, the Godspeed survey results indicated that our model was rated significantly higher than the baseline, which did not include a computational emotion model, in terms of anthropomorphism, animacy, and perceived intelligence. Notably, the unpredictabil ity of emotion switching contributed to a perception of greater lifelikeness, which in turn enhanced the overall interaction experience.

08:35-08:40, Paper WeAT12.2	Add to My Program
CAS: Fusing DNN Optimization & Adaptive Sensing for Energy-Efficient Multi-Modal Inference

Weerakoon Mudiyanselage, Dulanga Kaveesha Weerakoon	Singapore-MIT Alliance for Research & Technology
Subbaraju, Vigneshwaran	Agency for Science Technology and Research (A*STAR)
Lim, Joo Hwee	I2R A*STAR
Misra, Archan	Singapore Management University
Keywords: Human-Robot Collaboration, Multi-Modal Perception for HRI, Embedded Systems for Robotic and Automation Abstract: Intelligent virtual agents are used to accomplish complex multi-modal tasks such as human instruction comprehension in mixed-reality environments by increasingly adopting richer, energy-intensive sensors and processing pipelines. In such applications, the context for activating sensors and processing blocks required to accomplish a given task instance is usually manifested via multiple sensing modes. Based on this observation, we introduce a novel Commit-and-Switch (CAS) paradigm that simultaneously seeks to reduce both sensing and processing energy. In CAS, we first commit to a low-energy computational pipeline with a subset of available sensors. Then, the task context estimated by this pipeline is used to optionally switch to another energy-intensive DNN pipeline and activate additional sensors. We demonstrate how CAS’s paradigm of interweaving DNN computation and sensor triggering can be instantiated principally by constructing multi-head DNN models and jointly optimizing the accuracy and sensing costs associated with different heads. We exemplify CAS via the development of the RealGIN-MH model for multi-modal target acquisition tasks, a core enabler of immersive human-agent interaction. RealGIN-MH achieves 12.9x reduction in energy overheads, while outperforming baseline dynamic model optimization approaches.

08:40-08:45, Paper WeAT12.3	Add to My Program
"Oh! It's Fun Chatting with You!" a Humor-Aware Social Robot Chat Framework

Zhang, Heng	ENSTA Paris, Institut Polytechnique De Paris
Saood, Adnan	ENSTA Paris - Institute Polytechnique De Paris
García Cárdenas, Juan José	ENSTA - Institute Polytechinique De Paris
Hei, Xiaoxuan	ENSTA Paris, Institut Polytechnique De Paris
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Social HRI, Physical Human-Robot Interaction Abstract: Humor is a key element in human interactions, essential for building connections and rapport. To enhance human-robot communication, we developed a humor-aware chat framework that enables robots to deliver contextually appropriate humor. This framework takes into account the interaction environment, and user’s profile as well as emotional state. Two GPT models are used to generate responses. The initial one, named sensor-GPT, processes contextual data from the sensor along with the user’s response and conversation history to create prompts for the second one, chat-GPT. These prompts can guide the model on how to integrate appropriate humor elements into the conversation, ensuring that the dialogue is both contextually relevant and humorous. Our experiment compared the effectiveness of humor expression between our framework and the GPT-4o model. The results demonstrate that robots using our framework significantly outperform those using GPT-4o in humor expression, extending conversations, and improving overall interaction quality.

08:45-08:50, Paper WeAT12.4	Add to My Program
Social Gesture Recognition in SpHRI: Leveraging Fabric-Based Tactile Sensing on Humanoid Robots

Crowder, Dakarai	University of Illinois Urbana Champaign
Vandyck, Kojo Egyir	University of Illinois Urbana-Champaign
Sun, Xiping	University of Illinois Urbana-Champaign Champaign, IL ‧ Pu
McCann, James	Carnegie Mellon University
Yuan, Wenzhen	University of Illinois
Keywords: Physical Human-Robot Interaction, Touch in HRI Abstract: Humans are able to convey different messages using only touch. Equipping robots with the ability to understand social touch adds another modality in which humans and robots can communicate. In this paper, we present a social gesture recognition system using a fabric-based, large-scale tactile sensor integrated onto the arms of a humanoid robot. We built a social gesture dataset using multiple participants and extracted temporal features for classification. By collecting real-world data on a humanoid robot, our system provides valuable insights into human-robot social touch, further advancing the development of spHRI systems for more natural and effective communication.

08:50-08:55, Paper WeAT12.5	Add to My Program
Seeing Eye to Eye: Design and Evaluation of a Custom Expressive Eye Display Module for the Stretch Mobile Manipulator

Morales Mayoral, Rafael	Oregon State University
Buchmeier, Sean	Oregon State University
Mockel, Stayce	Oregon State University
Chavez, Courtney J.	Oregon State University
Fitter, Naomi T.	Oregon State University
Keywords: Gesture, Posture and Facial Expressions, Intention Recognition, Human-Robot Collaboration Abstract: Mobile manipulators - robots with a moving base and an arm for grasping objects - are becoming more common in human-populated environments, such as hospitals, warehouses, and even homes. Yet most mobile manipulators lack clear ways to communicate intent to human interlocutors in a continuous, socially acceptable, and easy-to-interpret way. One possible solution for improving mobile manipulator communication is the addition of expressive eyes. This paper presents the design and evaluation of a custom expressive LED eye module for mobile manipulators, which can display both gaze and emotional expressions. Our evaluation study (N = 32) involved a mock teamwork task alongside a Hello Robot Stretch RE2 mobile manipulator with the custom LED eye module. The results showed that both gaze and emotional expressions supported better participant performance in the task and more feelings of social closeness. Emotional eye expressions also yielded higher ratings of robot social warmth and competence. This work can inform mobile manipulator design for smoother integration into human-populated spaces.

08:55-09:00, Paper WeAT12.6	Add to My Program
UGotMe: An Embodied System for Affective Human-Robot Interaction

Li, Peizhen	Macquarie University
Cao, Longbing	Macquarie University
Wu, Xiao-Ming	Sun Yat-Sen University
Yu, Xiaohan	Macquarie University
Runze, Yang	Macquarie University
Keywords: Social HRI, Gesture, Posture and Facial Expressions, Emotional Robotics Abstract: Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1)distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/


WeAT13 Regular Session, 316	Add to My Program
Soft Robotic Grasping 1

Chair: Ichnowski, Jeffrey	Carnegie Mellon University
Co-Chair: Stewart-Height, Abriana	Massachusetts Institute of Technology

08:30-08:35, Paper WeAT13.1	Add to My Program
SCU-Hand: Soft Conical Universal Robotic Hand for Scooping Granular Media from Containers of Various Sizes

Takahashi, Tomoya	OMRON SINIC X Corporation
Beltran-Hernandez, Cristian Camilo	OMRON SINIC X Corporation
Kuroda, Yuki	OMRON SINIC X Corporation
Tanaka, Kazutoshi	OMRON SINIC X Corporation
Hamaya, Masashi	OMRON SINIC X Corporation
Ushiku, Yoshitaka	OMRON SINIC X Corpolation
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Robotics and Automation in Life Sciences Abstract: Automating small-scale experiments in materials science presents challenges due to the heterogeneous nature of experimental setups. This study introduces the SCU-Hand (Soft Conical Universal Robot Hand), a novel end-effector designed to automate the task of scooping powdered samples from various container sizes using a robotic arm. The SCU-Hand employs a flexible, conical structure that adapts to different container geometries through deformation, maintaining consistent contact without complex force sensing or machine learning-based control methods. Its reconfigurable mechanism allows for size adjustment, enabling efficient scooping from diverse container types. By combining soft robotics principles with a sheet-morphing design, our end-effector achieves high flexibility while retaining the necessary rigidity for effective powder manipulation. We detail the design principles, fabrication process, and experimental validation of the SCU-Hand. Experimental validation showed that the scooping capacity is about 20% higher than that of a commercial tool, with a scooping performance of more than 95% for containers of sizes between 67 mm to 110 mm. This research contributes to laboratory automation by offering a cost-effective, easily implementable solution for automating tasks such as materials synthesis and characterization processes.

08:35-08:40, Paper WeAT13.2	Add to My Program
VSB - Variable Stiffness Based on Bowden Cables: A Simple Mechanism for Soft Robotic Hands

Puhlmann, Steffen	TU Berlin
Albu-Schäffer, Alin	DLR - German Aerospace Center
Höppner, Hannes	Berliner Hochschule Für Technik, BHT
Keywords: Compliant Joints and Mechanisms, Multifingered Hands Abstract: Soft robotic hands compensate for uncertainty in perception and actuation by leveraging passive deformation in their intrinsically compliant hardware, facilitating robust and dexterous interactions with their environment. The ability to adjust the level of compliance during operation has the potential to further improve the performance of these hands by enabling novel interaction strategies. However, achieving variable stiffness mechanically typically requires significant engineering complexity, making these systems difficult to manufacture, prone to error, and expensive. We present a novel, very simple mechanism for achieving variable stiffness. This mechanism employs tendon-driven antagonistic actuation, with Bowden cables connecting elastic elements to servomotors. It supports compact actuator designs, while the Bowden cables facilitate flexible component placement within a robotic system. Following our approach, variable stiffness actuators can be easily manufactured at low-cost from readily available materials. Despite its simplicity, we demonstrate that our mechanism provides consistent and precise control over stiffness levels and contact torques, showcasing its potential for a broad range of applications in soft robotic systems.

08:40-08:45, Paper WeAT13.3	Add to My Program
Design and Experimental Validation of Woodwork-Inspired Soft Pneumatic Grippers

Stewart-Height, Abriana	Massachusetts Institute of Technology
Bolli, Roberto	MIT
Kamienski, Emily	Massachusetts Institute of Technology
Asada, Harry	MIT
Keywords: Soft Robot Applications, Physical Human-Robot Interaction, Grippers and Other End-Effectors Abstract: This paper presents a novel design concept of a pair of soft gripper hands that can establish a secure connection between them for bearing a large load with a low air pressure. The design was inspired by dovetail joints in carpentry that enable a tight, strong connection between two pieces of wood. We propose to mimic the dovetail joint mechanism by using soft robotic fingers that interlace to each other for secure connection. The work was motivated by the need for securing a connection between two soft robotic arms for holding a balance-impaired older adult in case of losing balance. First, the design principle of dovetail-like secure soft finger connection is presented, and its potential application to a portable fall prevention system is described. Details of the dovetail soft finger design, its rapid inflation method, and other implementation issues are then discussed. Through experiments of a proof-of-concept prototype, it is validated that the dovetail soft fingers can bear at least 18 kg of load with only 52 kPa of air chamber pressure filled in 250 ms of charging time. At the end, the proposed method is compared to alternative methods using a Pugh chart.

08:45-08:50, Paper WeAT13.4	Add to My Program
A Variable Stiffness and Transformable Entanglement Soft Robotic Gripper

Zhang, Huayu	The Chinese University of Hong Kong
Pan, Tianle Flippy	The Chinese University of Hong Kong
Zhou, Jianshu	University of California, Berkeley
Liang, Boyuan	University of California, Berkeley
Shu, Jing	The Chinese University of Hong Kong
Zhu, Puchen	The Chinese University of Hong Kong
An, Jiajun	The Chinese University of Hong Kong
Liu, Yunhui	Chinese University of Hong Kong
Ma, Xin	Chinese Univerisity of HongKong
Keywords: Soft Robot Applications, Grippers and Other End-Effectors, Grasping Abstract: For objects with complex topological and geometrical features, stochastic topological grasping can be executed without the necessity for feedback or precise planning. However, this grasping method has two significant limitations. First, the technique’s effectiveness is reduced when interacting with topologically and geometrically simple objects like spheres, cubes, and cylinders, due to the inherent variability in grasping patterns. Additionally, the method’s low stiffness restricts its ability to securely handling heavier objects. To address these challenges, this paper proposes an entanglement soft robotic gripper with variable stiffness and two transformed grasping modes (entanglement and clamping modes). The gripper contains three filaments, which can enhance the stiffness through the mechanism of layer jamming. Furthermore, the entanglement mode and the clamping mode, can be transformed by adjusting the working length of the filaments. The grasping performance comparison with and without variable stiffness was carried out, and the results indicated that the implementation of variable stiffness led to a 149 % increase in payload weight. Through experimental validation, we successfully employed the gripper in variable stiffness and transformed modes to grasp items with various shapes and weights. Demonstration of grasping heavier objects and transforming between two grasping modes were also conducted to showcase the adaptability and versatility of the gripper.

08:50-08:55, Paper WeAT13.5	Add to My Program
Soft Robotic Dynamic In-Hand Pen Spinning

Yao, Yunchao	Carnegie Mellon University
Yoo, Uksang	Carnegie Mellon University
Oh, Jean	Carnegie Mellon University
Atkeson, Christopher	CMU
Ichnowski, Jeffrey	Carnegie Mellon University
Keywords: In-Hand Manipulation, Modeling, Control, and Learning for Soft Robots, Dexterous Manipulation Abstract: Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SoftSpin, a system for dynamic spinning using a soft and compliant robotic hand. Unlike previous works that rely on quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen’s physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin the pen robustly and reliably. After 130 sampled actions, SoftSpin achieves 100 % success rate across three pens with different weights and weight distributions, demonstrating the system’s generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SoftSpin generalizes to spinning tools with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io

08:55-09:00, Paper WeAT13.6	Add to My Program
Kinetostatics and Retention Force Analysis of Soft Robot Grippers with External Tendon Routing

Gunderman, Anthony	University of Arkansas
Wang, Yifan	Georgia Institute of Technology
Gunderman, Benjamin	University of Arkansas
Qiu, Alex	Georgia Institute of Technology
Azizkhani, Milad	Georgia Institute of Technology
Sommer, Joseph	Georgia Institute of Technology
Chen, Yue	Georgia Institute of Technology
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Grippers and Other End-Effectors Abstract: Soft robots (SR) are a class of continuum robots that enable safe human interaction with task versatility beyond rigid robots. This has resulted in their rapid adoption in a number of applications that require manipulation of delicate and irregular objects. Despite their advantages, SR grippers typically require case-specific experimental characterization for shape and gripper retention force estimation. This letter presents a kinetostatic modeling approach based on strain energy minimization subject to mechanics and geometric constraints for shape estimation of SR grippers with external tendon routing (ETR), including those with composite structures. Additionally, Castigliano's First Theorem is used to estimate the retention force of the gripper. These models are evaluated across four different ETR SR grippers. The mechanics model predicted the fingertip position and orientation with an accuracy of 1.06±0.62 mm (1.79%±1.05% of length) and 3.58°±2.82° with respect to tendon force and 0.72±0.45 mm (1.22%±0.76% of length) and 2.86°±2.11° with respect to tendon retraction. The retention force of the gripper was predicted with an average error of 0.20±0.12 N.


WeAT14 Regular Session, 402	Add to My Program
Teleoperation and Human-Robot Interaction

Chair: Charbonneau, Marie	University of Calgary
Co-Chair: Asama, Hajime	The University of Tokyo

08:30-08:35, Paper WeAT14.1	Add to My Program
Ego-A3: Adaptive Fusion-Based Disentangled Transformer for Egocentric Action Anticipation

Kim, Min Hyuk	Chonnam National University
Jung, JongWon	CHONNAM University
Lee, Eungi	Chonnam National University
Yoo, Seok Bong	Chonnam National University
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Wearable Robotics Abstract: Recently, egocentric action anticipation for wearable robotics cameras has gained considerable attention due to its capability to analyze nouns and verbs from a first-person view. However, this field encounters challenges due to various uncertainties, such as action-irrelevant information and semantically fused representations of verbs and nouns. To overcome these issues, we introduce Ego-A3, designed to improve the robustness and reliability of egocentric action anticipation systems. Ego-A3 adaptively extracts action-relevant data to efficiently utilize additional information beyond visual data. Additionally, Ego-A3 produces effective disentangled representations for verbs and nouns by employing learnable verb and noun queries. Experiments on the EpicKitchens-100 and EGTEA Gaze+ datasets demonstrate that Ego-A3 outperforms existing methods in top-1 accuracy and mean top-5 recall. Our code is publicly available at https://github.com/alsgur0720/egocentric_anticipation.

08:35-08:40, Paper WeAT14.2	Add to My Program
A New Variable-Gain Sliding Mode Filter and Its Application to Velocity Filtering

Aung, Myo Thant Sin	Yangon Technological University, Myanmar
Kikuuwe, Ryo	Hiroshima University
Paing, Soe Lin	North Carolina State University
Yang, Jun	National University of Singapore
Yu, Haoyong	National University of Singapore
Keywords: Haptics and Haptic Interfaces, Motion Control, Robust/Adaptive Control Abstract: This paper proposes a new variable gain sliding mode filter augmented by variable windowing for achieving smooth and reactive response over a broad range of input frequencies. The proposed filter can be seen as a synergistic combination of Kikuuwe et al.’s [1] sliding mode filter with varying gain and sliding surfaces and a novel varying-length moving-window algorithm. In all schemes, the estimated input speed is employed for rendering the filter parameters between low and high settings. The discrete-time algorithm of the proposed filter does not suffer from chattering due to implicit (backward) Euler method. The effectiveness of the proposed filter in achieving better trade-off between noise attenuation and signal preservation is validated in both simulation and experimental scenarios by using the velocity signal obtained by differentiation of quantized position data.

08:40-08:45, Paper WeAT14.3	Add to My Program
A Comparative Study between a Virtual Wand and a One-To-One Approach for the Teleoperation of a Nearby Robotic Manipulator

Poignant, Alexis	Sorbonne Université, ISIR UMR 7222 CNRS
Morel, Guillaume	Sorbonne Université, CNRS, INSERM
Jarrassé, Nathanael	Sorbonne Université, ISIR UMR 7222 CNRS
Keywords: Telerobotics and Teleoperation, Physically Assistive Devices Abstract: The prevailing and most effective approach to teleoperate a robotic arm involves a direct position-to-position mapping, imposing robotic end-effector movements that mirrors those of the user. However, due to this one-to-one mapping, the robot's motions are limited by the user's capability, particularly in translation. Drawing inspiration from head pointers utilized in the 1980s, originally designed to enable drawing with limited head motions for tetraplegic individuals, we proposed a "virtual wand" mapping which could be used by participants with reduced mobility. This mapping employs a virtual rigid linkage between the hand and the robot's end-effector. With this approach, rotations produce amplified translations through a lever arm, creating a "rotation-to-position" coupling and expanding the translation workspace at the expense of a reduced rotation space. In this study, we compare the virtual wand approach to the one-to-one position mapping through the realization of 6-DoF reaching tasks. Results indicate that the two different mappings perform comparably well, are equally well-received by users, and exhibit similar motor control behaviors. Nevertheless, the virtual wand mapping is anticipated to outperform in tasks characterized by large translations and minimal effector rotations, whereas direct mapping is expected to demonstrate advantages in large rotations with minimal translations. These results pave the way for new interactions and interfaces, particularly in disability assistance utilizing residual body movements (instead of hands) as control input. Leveraging body parts with substantial rotations could enable the accomplishment of tasks previously deemed infeasible with standard direct coupling interfaces.

08:45-08:50, Paper WeAT14.4	Add to My Program
A Novel Telelocomotion Framework with CoM Estimation for Scalable Locomotion on Humanoid Robots

He, An-Chi	Virginia Tech
Li, Junheng	University of Southern California
Park, Jungsoo	Virginia Tech
Kolt, Omar	University Southern California
Beiter, Benjamin	Virginia Polytechnic Institute and State University
Leonessa, Alexander	Virginia Tech
Nguyen, Quan	University of Southern California
Akbari Hamed, Kaveh	Virginia Tech
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Humanoid and Bipedal Locomotion Abstract: Teleoperated humanoid robot systems have made substantial advancements in recent years, offering a physical avatar that harnesses human skills and decision-making while safeguarding users from hazardous environments. However, current telelocomotion interfaces often fail to accurately represent the robot's environment, limiting the user’s ability to effectively navigate the robot through unstructured terrain. This paper presents an initial telelocomotion framework that integrates the ForceBot locomotion interface with the small-sized humanoid robot, HECTOR V2. The framework utilizes ForceBot to simulate walking motion and estimate the user’s Center of Mass (CoM) trajectory, which serves as a tracking reference for the robot. On the robot side, a model predictive control (MPC) approach, based on a reduced-order single rigid body model, is employed to track the user’s scaled trajectory. We present experimental results on ForceBot’s CoM estimation and the robot’s tracking performance, demonstrating the feasibility of this approach.

08:50-08:55, Paper WeAT14.5	Add to My Program
Stiffness Regulation Co-Pilot in Bilateral Teleimpedance Control: A Preliminary User Study

Gomez Hernandez, Pedro	Aarhus University Herning
Jakobsen, Jonas Mariager	SDU Robotics, the Maersk Mc-Kinney Moller Institute, University
Pacchierotti, Claudio	Centre National De La Recherche Scientifique (CNRS)
Chinello, Francesco	Aarhus University
Fang, Cheng	University of Southern Denmark
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Physical Human-Robot Interaction Abstract: Variable stiffness of a remote robot is crucial for a teleoperation system to deal with challenging tasks. External stiffness command interfaces have emerged as a promising solution to regulating the remote robot stiffness because of the benefits of their accuracy, ergonomics, and avoidance of the "coupling effect" that usually exists in muscle activity-based stiffness interfaces. However, the use of an external stiffness command interface requires good coordination between two limbs of an operator, which take care of the teleoperation task and the stiffness regulation task, respectively, at the same time, which is demanding for novice operators in dynamic situations necessitating agile and timely stiffness adjustments. In this paper, a new concept of Stiffness Regulation Co-pilot was proposed to facilitate the use of these interfaces. A co-pilot is a virtual agent that consists of a Stiffness Regulation Policy, which infers a reasonable stiffness regulation action from the task performance, and a feedback modality, which conveys the suggested stiffness regulation action to the operator. A preliminary user study was conducted to evaluate the efficacy of the co-pilot and the effect of different modalities of the co-pilot. The results showed that the cutaneous feedback or combined with another modality can potentially improve the task performance of the system and reduce the cognitive load of the operator compared to a teleoperation system without using the co-pilot.

08:55-09:00, Paper WeAT14.6	Add to My Program
Adaptive Neural Network Synchronous Tracking Control for Teleoperation Robots under Event-Triggered Mechanism

Wang, Fujie	Dongguan University of Technology
Yu, Yuanjia	Shenzhen University
Li, Xing	School of Electrical Engineering & Intelligentization, Dongguan
Luo, Junxuan	Dongguan University of Technology
Zhong, Jinming	Shenzhen University
Keywords: Motion Control, Human-Robot Collaboration, Grippers and Other End-Effectors Abstract: This paper proposes an adaptive neural network synchronous tracking control strategy that can be suitable for event-triggered mechanism in response to the modeling uncertainties and communication delays in bilateral teleoperation systems. Through introducing the event-triggered mechanism with the aim of reducing the network communication frequency in teleoperation system, the master and slave robots communicate with each other only when the triggering conditions are fulfilled, which enhances the efficiency of the network communication. This control strategy can guarantee the exponential convergence of the position synchronization tracking error of the master-slave robot end-effector. Moreover, the event-triggered conditions do not require any empirical design, but can be derived inversely with the aid of the Lyapunov stability theory. And the triggering time interval between two neighboring events is verified to be non-zero. It is demonstrated by utilizing the Lyapunov principle that the presented adaptive neural network control strategy ensures the final asymptotic convergence and exponential convergence of the position synchronization tracking error for master-slave robots under the designed event-triggered mechanisms. Eventually, the feasibility and effectiveness of the developed control strategy are validated by comparative cases.


WeAT15 Regular Session, 403	Add to My Program
Bimanual Manipulation 1

Chair: Liu, Shuijing	The University of Texas at Austin
Co-Chair: Johns, Edward	Imperial College London

08:30-08:35, Paper WeAT15.1	Add to My Program
Learning Visuotactile Skills with Two Multifingered Hands

Lin, Toru	University of California, Berkeley
Zhang, Yu	University of California Berkeley
Li, Qiyang	University of California, Berkeley
Qi, Haozhi	UC Berkeley
Yi, Brent	University of California, Berkeley
Levine, Sergey	UC Berkeley
Malik, Jitendra	UC Berkeley
Keywords: Bimanual Manipulation, Dexterous Manipulation, Learning from Demonstration Abstract: Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found on: https://toruowo.github.io/hato

08:35-08:40, Paper WeAT15.2	Add to My Program
Learning Coordinated Bimanual Manipulation Policies Using State Diffusion and Inverse Dynamics Models

Chen, Haonan	University of Illinois at Urbana-Champaign
Xu, Jiaming	University of Illinois Urbana-Champaign
Sheng, Lily	Tsinghua University
Ji, Tianchen	University of Illinois at Urbana-Champaign
Liu, Shuijing	The University of Texas at Austin
Li, Yunzhu	Columbia University
Driggs-Campbell, Katherine	University of Illinois at Urbana-Champaign
Keywords: AI-Based Methods, Bimanual Manipulation, Imitation Learning Abstract: When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

08:40-08:45, Paper WeAT15.3	Add to My Program
BiFold: Bimanual Cloth Folding with Language Guidance

Barbany, Oriol	IRI (CSIC-UPC)
Colomé, Adrià	Institut De Robòtica I Informàtica Industrial (CSIC-UPC), Q28180
Torras, Carme	Csic - Upc
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Data Sets for Robot Learning Abstract: Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves stateof-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.

08:45-08:50, Paper WeAT15.4	Add to My Program
One-Shot Dual-Arm Imitation Learning

Wang, Yilong	Imperial College London
Johns, Edward	Imperial College London
Keywords: Dual Arm Manipulation, Imitation Learning, Visual Servoing Abstract: We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at https://www.robot-learning.uk/one-shot-dual-arm.

08:50-08:55, Paper WeAT15.5	Add to My Program
In the Wild Ungraspable Object Picking with Bimanual Nonprehensile Manipulation

Wu, Albert	Stanford University
Kruse, Daniel	Rensselaer Polytechnic Institute
Keywords: Dual Arm Manipulation, Mobile Manipulation, Grasping Abstract: Picking diverse objects in the real world is a fundamental robotics skill. However, many objects in such settings are bulky, heavy, or irregularly shaped, making them ungraspable by conventional end effectors like suction grippers and parallel jaw grippers (PJGs). In this paper, we expand the range of pickable items without hardware modifications using bimanual nonprehensile manipulation. We focus on a grocery shopping scenario, where a bimanual mobile manipulator equipped with a suction gripper and a PJG is tasked with re- trieving ungraspable items from tightly packed grocery shelves. From visual observations, our method first identifies optimal grasp points based on force closure and friction constraints. If the grasp points are occluded, a series of nonprehensile nudging motions are performed to clear the obstruction. A bimanual grasp utilizing contacts on the side of the end effectors is then executed to grasp the target item. In our replica grocery store, we achieved a 90% success rate over 102 trials in uncluttered scenes, and a 66% success rate over 45 trials in cluttered scenes. We also deployed our system to a real-world grocery store and successfully picked previously unseen items. Our results highlight the potential of bimanual nonprehensile manipulation for in-the-wild robotic picking tasks. A video summarizing this work can be found at youtu.be/g0hOrDuK8jM

08:55-09:00, Paper WeAT15.6	Add to My Program
Bimanual Grasp Synthesis for Dexterous Robot Hands

Shao, Yanming	ShanghaiTech University
Xiao, Chenxi	ShanghaiTech University
Keywords: Bimanual Manipulation, Grasping, Dexterous Manipulation Abstract: Humans naturally perform bimanual skills to handle large and heavy objects. To enhance a robot's object manipulation capabilities, generating effective bimanual grasp poses is essential. Nevertheless, bimanual grasp synthesis for dexterous hand manipulators remains underexplored. To bridge this gap, we propose the BimanGrasp algorithm for synthesizing bimanual grasps on 3D objects. The BimanGrasp algorithm generates grasp poses by optimizing an energy function that considers grasp stability and feasibility. Furthermore, the quality of the synthesized grasps is verified using the Isaac Gym physics simulation engine. These verified grasp poses form the BimanGrasp-Dataset, which is the first synthesized bimanual dexterous hand grasp pose dataset to our knowledge. The dataset comprises over 150k verified grasps on 900 objects, facilitating the synthesis of bimanual grasps through a data-driven approach. Last, we propose a diffusion model (BimanGrasp-DDPM) trained on the BimanGrasp-Dataset. This model achieved a grasp synthesis success rate of 69.87% and significant acceleration in computational speed compared to BimanGrasp algorithm.


WeAT16 Regular Session, 404	Add to My Program
Grasping 1

Chair: Kasaei, Hamidreza	University of Groningen
Co-Chair: Thondiyath, Asokan	IIT Madras

08:30-08:35, Paper WeAT16.1	Add to My Program
Efficient 7-DoF Grasp for Target-Driven Object in Dense Cluttered Scenes

Lei, Tianjiao	Chongqing University
Sun, Yizhuo	Harbin Institute of Technology
Huang, Yi	Chongqing University
Huang, Jiangshuai	Nanyang Technological University
Jiang, Tao	Chongqing University
Keywords: Grasping, Perception for Grasping and Manipulation, Cyborgs Abstract: Achieving a real-time precise grasp of a specified target object in densely cluttered environments is an essential capability for autonomous robot operation. Recently, considerable investigations on planar and spatial grasp have been carried out, and significant results have been obtained. However,these point cloud-based grasp prediction methods often fail to ensure that the generated grasp configurations meet the precise requirements of the task. Additionally, some of the existing grasp pipelines are too time-consuming to meet the demand for real-time robot response. In more challenging cluttered scenes,the quality of pose and gripper jaw opening estimation in highdimensional space requires further improvement. Therefore,this paper introduces a data- and model-independent and efficient method to generate 7-DoF grasp configurations for arbitrary target objects from single-view point cloud data in dense cluttered scenes. In addition, this paper proposes a grasp framework that generates the grasp configuration for the target object while reducing the time consumed during the grasp process, to enable robots to efficiently grasp target objects for designated tasks. The grasp pipeline focuses on guided regions via target detection and rapidly adjusts grasp configurations through multi-region point cloud distribution perception. Extensive real-world robot experiments have demonstrated the effectiveness of the proposed method in grasping target objects in cluttered scenes, achieving higher success rates and reduced runtime compared to baseline methods.The realized code and video are available at https://github.com/L-tj/7DGCG.

08:35-08:40, Paper WeAT16.2	Add to My Program
Task-Oriented 6-DoF Grasp Pose Detection in Clutters

Wang, An-Lan	Sun Yat-Sen University
Chen, Nuo	Sun Yat-Sen University
Lin, Kun-Yu	Sun Yat-Sen University
Yuan-Ming, Li	Sun Yat-Sen University
Zheng, Wei-Shi	Sun Yat-Sen University
Keywords: Grasping Abstract: In general, humans would grasp an object differently for different tasks, e.g., ``grasping the handle of a knife to cut'' vs. ``grasping the blade to hand over''. In the field of robotic grasp pose detection research, some existing works consider this task-oriented grasping and made some progress, but they are generally constrained by low-DoF gripper type or non-cluttered setting, which is not applicable for human assistance in real life. With an aim to get more general and practical grasp models, in this paper, we investigate a new problem named Task-Oriented 6-DoF Grasp Pose Detection in Clutters (TO6DGC), which extends the task-oriented problem to a more general 6-DOF Grasp Pose Detection in Cluttered (multi-object) scenario. To this end, we construct a large-scale 6-DoF task-oriented grasping dataset, 6-DoF Task Grasp (6DTG), which features 4391 cluttered scenes with over 2 million 6-DoF grasp poses. Each grasp is annotated with a specific task, involving 6 tasks and 198 objects in total. Moreover, we propose One-Stage TaskGrasp (OSTG), a strong baseline to address the TO6DGC problem. Our OSTG adopts a task-oriented point selection strategy to detect where to grasp, and a task-oriented grasp generation module to decide how to grasp given a specific task. To evaluate the effectiveness of OSTG, extensive experiments are conducted on 6DTG. The results show that our method outperforms various baselines on multiple metrics. Real robot experiments also verify that our OSTG has a better perception of the task-oriented grasp points and 6-DoF grasp poses.

08:40-08:45, Paper WeAT16.3	Add to My Program
QuickGrasp: Lightweight Antipodal Grasp Planning with Point Clouds

Ravie, Navin Sriram	Indian Institute of Technology Madras
Murugan, Keerthi Vasan	Indian Institute of Technology Madras
Thondiyath, Asokan	IIT Madras
Sebastian, Bijo	IIT Madras
Keywords: Grasping, Manipulation Planning, Perception for Grasping and Manipulation Abstract: Grasping has been a long-standing challenge in facilitating the final interface between a robot and the environment. As environments and tasks become complicated, the need to embed higher intelligence to infer from the surroundings and act on them has become necessary. Although most methods utilize techniques to estimate grasp pose by treating the problem via pure sampling-based approaches in the six-degree-of-freedom space or as a learning problem, they usually fail in real-life settings owing to poor generalization across domains. In addition, the time taken to generate the grasp plan and the lack of repeatability, owing to sampling inefficiency and the probabilistic nature of existing grasp planning approaches, severely limits their application in real-world tasks. This paper presents a lightweight analytical approach towards robotic grasp planning, particularly antipodal grasps, with little to no sampling in the six-degree-of-freedom space. The proposed grasp planning algorithm is formulated as an optimization problem towards estimating grasp points on the object surface instead of directly estimating the end-effector pose. To this extent, a soft-region-growing algorithm is presented for effective plane segmentation, even in the case of curved surfaces. An optimization-based quality metric is then used for evaluation of grasp points to ensure indirect force closure. The proposed grasp framework is compared with existing state-of-the-art grasp planning approach Grasp pose detection (GPD) as baseline over multiple simulated objects. The effectiveness of the proposed approach in comparison to GPD is also evaluated in real-world setting using image and point-cloud data, with the planned grasps being executed using a ROBOTIQ gripper and UR5 manipulator. The proposed approach shows better performance in terms of higher probability for force closure with a complete repeatability.

08:45-08:50, Paper WeAT16.4	Add to My Program
Behavioral Manifolds: Representing the Landscape of Grasp Affordances in Relative Pose Space

Zechmair, Michael	Maastricht University
Morel, Yannick	Maastricht University
Keywords: Grasping, Grippers and Other End-Effectors, Manipulation Planning Abstract: The use of machine learning to investigate grasp affordances has received extensive attention over the past several decades. The existing literature provides a robust basis to build upon, though a number of aspects may be improved. Results commonly work in terms of grasp configuration, with little consideration for the manner in which the grasp may be (re-)produced, from a reachability and trajectory planning perspective. We propose a different perspective on grasp affordance learning, explicitly accounting for grasp synthesis; that is, the manner in which manipulator kinematics are used to allow materialization of grasps. The approach allows to explicitly map the grasp policy space in terms of generated grasp types and associated grasp quality. Results of application to a range of objects illustrate merit of the method and highlight the manner in which it may promote a greater degree of explainability for otherwise intransparent reinforcement processes.

08:50-08:55, Paper WeAT16.5	Add to My Program
NeRF-Based Transparent Object Grasping Enhanced by Shape Priors

Han, Yi	Shenzhen Technology University
Lin, Zixin	Shenzhen Technology University
Li, DongJie	Shenzhen Technology University
Chen, Lvping	Shenzhen Technology Universit
Shi, Yongliang	Tsinghua University
Ma, Gan	Shenzhen Technology University
Keywords: Grasping Abstract: Transparent object grasping remains a persistent challenge in robotics, largely due to the difficulty of acquiring precise 3D information. Conventional optical 3D sensors struggle to capture transparent objects, and machine learning methods are often hindered by their reliance on high-quality datasets. Leveraging NeRF’s capability for continuous spatial opacity modeling, our proposed architecture integrates a NeRF-based approach for reconstructing the 3D information of transparent objects. Despite this, certain portions of the reconstructed 3D information may remain incomplete. To address these deficiencies, we introduce a shape-prior-driven completion mechanism, further refined by a geometric pose estimation method we have developed. This allows us to obtain a complete and reliable 3D information of transparent objects. Utilizing this refined data, we perform scene-level grasp prediction and deploy the results in real-world robotic systems. Experimental validation demonstrates the efficacy of our architecture, showcasing its capability to reliably capture 3D information of various transparent objects in cluttered scenes, and correspondingly, achieve high-quality, stable, and executable grasp predictions.

08:55-09:00, Paper WeAT16.6	Add to My Program
Center Direction Network for Grasping Point Localization on Cloths

Tabernik, Domen	University of Ljubljana
Muhovič, Jon	Faculty of Electrical Engineering, University of Ljubljana
Urbas, Matej	University of Ljubljana, Faculty of Computer and Information Sci
Skocaj, Danijel	University of Ljubljana
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, RGB-D Perception Abstract: Object grasping is a fundamental challenge in robotics and computer vision, critical for advancing robotic manipulation capabilities. Deformable objects, like fabrics and cloths, pose additional challenges due to their non-rigid nature. In this work, we introduce CeDiRNet-3DoF, a deep-learning model for grasp point detection, with a particular focus on cloth objects. CeDiRNet-3DoF employs center direction regression alongside a localization network, attaining first place in the perception task of ICRA 2023's Cloth Manipulation Challenge. Recognizing the lack of standardized benchmarks in the literature that hinder effective method comparison, we present the ViCoS Towel Dataset. This extensive benchmark dataset comprises 8,000 real and 12,000 synthetic images, serving as a robust resource for training and evaluating contemporary data-driven deep-learning approaches. Extensive evaluation revealed CeDiRNet-3DoF's robustness in real-world performance, outperforming state-of-the-art methods, including the latest transformer-based models. Our work bridges a crucial gap, offering a robust solution and benchmark for cloth grasping in computer vision and robotics.


WeAT17 Regular Session, 405	Add to My Program
Localization 3

Chair: Joerger, Mathieu	Virginia Tech
Co-Chair: Halperin, Dan	Tel Aviv University

08:30-08:35, Paper WeAT17.1	Add to My Program
How Safe Is Particle Filtering-Based Localization for Mobile Robots? an Integrity Monitoring Approach

Abdul Hafez, Osama	American University of Madaba
Joerger, Mathieu	Virginia Tech
Spenko, Matthew	Illinois Institute of Technology
Keywords: Localization, Probability and Statistical Methods, Robot Safety, Autonomous Vehicle Navigation Abstract: Deriving safe bounds on particle filter estimate is a research problem that, if solved, could greatly benefit robots in life-critical applications, a field that is facing increasing interest as more robots are being deployed near humans. In response, this paper introduces a new fault detector and derives a performance measure for particle filter: integrity risk. Integrity risk is defined as the probability of having large estimate errors without triggering an alarm, all while considering measurement faults, unknown deterministic errors that cannot be modeled via normal white noise. In this work, the faults come in the form of incorrectly associated features when using the local nearest neighbors. Simulations and experiments assess the efficiency of the introduced safety metric. The results show that safety improves as map density increases as long as the number of particles is sufficient to shape the error distribution and the landmarks are well separated. Also, the results indicate that, when landmarks are poorly separated, particle filter is safer than Kalman filter, whereas, when landmarks are well separated, particle filter is often, but not always, safer than Kalman filter.

08:35-08:40, Paper WeAT17.2	Add to My Program
Lighthouse Localization of Miniature Wireless Robots

Alvarado-Marin, Said	INRIA
Huidobro-Marin, Cristobal	INRIA
Balbi, Martina	INRIA
Savic, Trifun	INRIA
Watteyne, Thomas	Inria
Maksimovic, Filip	INRIA
Keywords: Localization, Multi-Robot Systems, Wheeled Robots Abstract: In this paper, we apply lighthouse localization, originally designed for virtual reality motion tracking, to positioning and localization of indoor robots. We first present a lighthouse decoding and tracking algorithm on a low-power wireless microcontroller with hardware implemented in a cm-scale form factor. One-time scene solving is performed on a computer using a variety of standard computer vision tech-niques. Three different robotic localization scenarios are analyzed in this work. The first is a planar scene with a single lighthouse with a four-point pre-calibration. The second is a planar scene with two light-houses that self calibrates with either multiple robots in the experiment or a single robot in motion. The third extends to a 3D scene with two lighthouses and a self-calibration algorithm. The absolute accuracy, measured against a camera-based tracking system, was found to be 7.25 mm RMS for the 2D case and 11.2 mm RMS for the 3D case, respectively. This demonstrates the viability of lighthouse tracking both for small-scale robotics and as an inexpensive and compact alternative to camera-based setups.

08:40-08:45, Paper WeAT17.3	Add to My Program
EVLoc: Event-Based Visual Localization in LiDAR Maps Via Event-Depth Registration

Chen, Kuangyi	Graz University of Technology
Zhang, Jun	Graz University of Technology
Fraundorfer, Friedrich	Graz University of Technology
Keywords: Localization, Deep Learning for Visual Perception Abstract: Event cameras are bioinspired sensors with some notable features, including high dynamic range and low latency, which makes them exceptionally suitable for perception in challenging scenarios such as high-speed motion and extreme lighting conditions. In this paper, we explore their potential for localization within pre-existing LiDAR maps, a critical task for applications that require precise navigation and mobile manipulation. Our framework follows a paradigm based on the refinement of an initial pose. Specifically, we first project LiDAR points into 2D space based on a rough initial pose to obtain depth maps, and then employ an optical flow estimation network to align events with LiDAR points in 2D space, followed by camera pose estimation using a PnP solver. To enhance geometric consistency between these two inherently different modalities, we develop a novel frame-based event representation that improves structural clarity. Additionally, given the varying degrees of bias observed in the ground truth poses, we design a module that predicts an auxiliary variable as a regularization term to mitigate the impact of this bias on network convergence. Experimental results on several public datasets demonstrate the effectiveness of our proposed method. To facilitate future research, both the code and the pre-trained models are made available online.

08:45-08:50, Paper WeAT17.4	Add to My Program
MambaGlue: Fast and Robust Local Feature Matching with Mamba

Ryoo, Kihwan	Korea Advanced Institute of Science and Technology
Lim, Hyungtae	Massachusetts Institute of Technology
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Localization, Deep Learning for Visual Perception, Recognition Abstract: In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue.

08:50-08:55, Paper WeAT17.5	Add to My Program
ULOC: Learning to Localize in Complex Large-Scale Environments with Ultra-Wideband Ranges

Nguyen, Thien-Minh	Nanyang Technological University
Yang, Yizhuo	Nangyang Technological Univercity
Nguyen, Tien-Dat	Ho Chi Minh City University of Technology (HCMUT), VNU-HCM
Yuan, Shenghai	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Localization, Range Sensing, Autonomous Vehicle Navigation Abstract: While UWB-based methods can achieve high localization accuracy in small-scale areas, their accuracy and reliability are significantly challenged in large-scale environments. In this paper, we propose a learning-based framework for Ultra-Wideband (UWB) based localization in such complex large-scale environments, named ULOC. First, anchors are deployed in the environment without knowledge of their actual position. Then, UWB observations are collected when the vehicle travels in the environment. At the same time, map-consistent pose estimates are developed from registering (onboard self-localization) data with the prior map to provide the training labels. We then propose a recurrent neural network (RNN) based on MAMBA that learns the ranging patterns of UWBs over a complex large-scale environment. The experiment demonstrates that our solution can ensure high localization accuracy on a large scale compared to the state-of-the-art. We release our source code to benefit the community at https://github.com/brytsknguyen/uloc.

08:55-09:00, Paper WeAT17.6	Add to My Program
Indoor Localization of UAVs Using Only Few Measurements by Output-Sensitive Preimage Intersection

Bilevich, Michael M.	Tel Aviv University
Buber, Tomer	Tel Aviv University
Halperin, Dan	Tel Aviv University
Keywords: Localization Abstract: We present a deterministic approach for the localization of an Unmanned Aerial Vehicle (UAV) in a known indoor environment by using only a few downward distance measurements and the corresponding odometries between measurements. For each distance measurement and odometry, we look at the preimage of that distance measurement under the downwards distance function combined with the corresponding odometry where the motion between every two measurements has four degrees of freedom: three of translation and one of azimuth change. The intersection of these preimages yields the set of all possible locations for the UAV. In this work, we present an efficient method for approximating that intersection of preimages. We perform a spatial subdivision search, which splits only voxels containing that intersection. We present a novel technique, based on geometric insights, for correctly evaluating whether a voxel indeed contains a true localization. This technique is also robust under different kinds of errors that might occur. Our method is guaranteed to contain the ground truth location, and its runtime complexity is output sensitive, in the Hausdorff dimension and measure of the resulting intersection of preimages. We demonstrate the effectiveness of this method in various indoor scenarios, showing that it can be used to significantly decrease the uncertainty of localization when solving the kidnapped robot problem in simulation and on a physical drone. Our method can be performed in real-time. Furthermore, our method requires only a map of the environment, odometry and ToF sensors, which is advantageous in terms of cost, privacy and transmission bandwidth. Our open-source software and supplementary materials are available at https://github.com/TAU-CGL/uav-fdml-public.


WeAT18 Regular Session, 406	Add to My Program
Software Tools 1

Chair: Kroeger, Torsten	Intrinsic Innovation LLC
Co-Chair: Moon, Hyungpil	Sungkyunkwan University

08:30-08:35, Paper WeAT18.1	Add to My Program
Motion Comparator: Visual Comparison of Robot Motions

Wang, Yeping	University of Wisconsin-Madison
Peseckis, Alexander	University of Wisconsin -- Madison
Jiang, Zelong	University of Wisconsin-Madison
Gleicher, Michael	University of Wisconsin - Madison
Keywords: Software Tools for Robot Programming, Software Tools for Benchmarking and Reproducibility Abstract: Roboticists compare robot motions for tasks such as parameter tuning, troubleshooting, and deciding between possible motions. However, most existing visualization tools are designed for individual motions and lack the features necessary to facilitate robot motion comparison. In this paper, we follow a rigorous design process to create Motion Comparator, a web-based tool that facilitates the comprehension, comparison, and communication of robot motions. Our design process identified roboticists' needs, articulated design challenges, and provided corresponding strategies. Motion Comparator includes several key features such as multi-view coordination, quaternion visualization, time warping, and comparative designs. To demonstrate the applications of Motion Comparator, we discuss four case studies in which our tool is used for motion selection, troubleshooting, parameter tuning, and motion review.

08:35-08:40, Paper WeAT18.2	Add to My Program
Text2Robot: Evolutionary Robot Design from Text Descriptions

Chen, Boyuan	Duke University
Charlick, Zachary Samuel Charlick	Duke University
Ringel, Ryan	Duke University
Liu, Jiaxun	Duke University
Xia, Boxi	Duke University
Keywords: Methods and Tools for Robot System Design, Evolutionary Robotics Abstract: Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.

08:40-08:45, Paper WeAT18.3	Add to My Program
QueryCAD: Grounded Question Answering for CAD Models

Kienle, Claudius	ArtiMinds Robotics GmbH
Alt, Benjamin	ArtiMinds Robotics
Katic, Darko	HFT STUTTGART
Jäkel, Rainer	Karlsruhe Institute of Technology
Peters, Jan	Technische Universität Darmstadt
Keywords: Deep Learning Methods, Engineering for Robotic Systems, Software Tools for Robot Programming Abstract: CAD models are widely used in industry and are essential for robotic automation processes. However, these models are rarely considered in novel AI-based approaches, such as the automatic synthesis of robot programs, as there are no readily available methods that would allow CAD models to be incorporated for the analysis, interpretation, or extraction of information. To address these limitations, we propose QueryCAD, the first system designed for CAD question answering, enabling the extraction of precise information from CAD models using natural language queries. QueryCAD incorporates SegCAD, an open-vocabulary instance segmentation model we developed to identify and select specific parts of the CAD model based on part descriptions. We further propose a CAD question answering benchmark to evaluate QueryCAD and establish a foundation for future research. Lastly, we integrate QueryCAD within an automatic robot program synthesis framework, validating its ability to enhance deep-learning solutions for robotics by enabling them to process CAD models.

08:45-08:50, Paper WeAT18.4	Add to My Program
HeRo: A State Machine-Based, Fault-Tolerant Framework for Heterogeneous Multi-Robot Collaboration

Tang, Ruijie	Institute of Software, Chinese Academy of Sciences
Wu, Guoquan	Institute of Software, Chinese Academy of Sciences
Wang, Tao	Institute of Software, Chinese Academy of Sciences
Chen, Wei	Institute of Software, Chinese Academy of Sciences
Wei, Jun	Institute of Software, Chinese Academy of Sciences
Keywords: Software Tools for Robot Programming, Software, Middleware and Programming Environments, Multi-Robot Systems Abstract: Heterogeneous robots can work together to accomplish a variety of complex tasks and have shown great potential in many fields. There are many efforts to make robot task orchestration more efficient. However, current methods still have some limitations, including the lack of a high-level abstraction for programming method and fault handling mechanism. In this paper, we design a state machine-based, fault-tolerant framework for heterogeneous multi-robot collaboration named HeRo, to effectively support the development of heterogeneous multi-robot systems. HeRo has three key techniques: (1) a state machine-based programming language to flexibly model robot behaviors and tasks; (2) a state synchronization mechanism to achieve information exchange and maintain the consistency among heterogeneous robots in distributed environments; (3) a fault detection and recovery mechanism to monitor the system's runtime states and use Large Language Model (LLM) combined with Planning Domain Definition Language (PDDL) to enable automated recovery. We evaluate the effectiveness and fault recovery capability of the framework by setting up manufacturing task and fault scenarios with varying difficulty in the ARIAC simulation environment, achieving a 100% task completion rate, with low system overhead and flexible scalability.

08:50-08:55, Paper WeAT18.5	Add to My Program
A Kinematics Optimization Framework with Improved Computational Efficiency for Task-Based Optimum Design of Serial Manipulators in Cluttered Environments

Petkov, Nikola	United Kingdom Atomic Energy Authority
Tokatli, Ozan	UKAEA
Zhang, Kaiqiang	UK Atomic Energy Authority
Wu, Huapeng	Lappeenranta University of Technology
Skilton, Robert Mark	UK Atomic Energy Authority
Keywords: Methods and Tools for Robot System Design, Engineering for Robotic Systems, Optimization and Optimal Control Abstract: It is challenging to find optimum kinematic designs for non-standard robotic manipulators, e.g., medical, nuclear, and space manipulators, which are demanded to adapt to arbitrary complex tasks in constraints. Such design optimization can be modelled as a multi-dimensional non-convex optimization problem with nonlinear constrained conditions. However, it is non-trivial to ensure the essential reachability condition, i.e., the existence of continuous trajectories between demand positions for serial articulated manipulators, given complex spatial constraints, like obstacles and boundaries. Traditional solutions integrate standard motion planning or inverse kinematics algorithms within a kinematic-design optimization process, resulting in significant demand for time and computing resources. To accelerate design optimization at improved efficiency, we design a novel robust design framework built on a new kinematic design synthesis, which allows for simultaneously optimizing dimension and topology of a serial manipulator's kinematics for arbitrary tasks in constrained environments, using a generalised parametric kinematic model. Significantly, in contrast to standard solutions, we develop a novel computationally effective reachability verification method, which rapidly aborts infeasible motions by exploiting efficient collision checks, based on the Rapidly-exploring Random Tree (RRT) algorithm. The effectiveness of the proposed design framework is verified and evaluated by comparing to baseline benchmarks. Results demonstrate the novel design framework can accelerate kinematic design optimization by an order of magnitude compared to the current state-of-the-art, and optimise link dimension and joint type simultaneously of serial robots for cluttered environments.

08:55-09:00, Paper WeAT18.6	Add to My Program
A Survey on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms

Mokhtarian, Armin	RWTH Aachen University
Xu, Jianye	Chair of Embedded Software (Informatik 11), RWTH Aachen Universi
Scheffe, Patrick	RWTH Aachen University
Kloock, Maximilian	RWTH Aachen University
Schäfer, Simon	RWTH Aachen University
Bang, Heeseung	University of Delaware
Le, Viet-Anh	University of Delaware
Ulhas, Sangeet	Arizona State University
Betz, Johannes	Technical University of Munich
Wilson, Sean	Georgia Institute of Technology, Georgia Tech Research Institute
Berman, Spring	Arizona State University
Paull, Liam	Université De Montréal
Prorok, Amanda	University of Cambridge
Alrifaee, Bassam	University of the Bundeswehr Munich
Keywords: Embedded Systems for Robotic and Automation, Engineering for Robotic Systems, Methods and Tools for Robot System Design Abstract: Connected and automated vehicles and robot swarms hold transformative potential for enhancing safety, efficiency, and sustainability in the transportation and manufacturing sectors. Extensive testing and validation of these technologies is crucial for their deployment in the real world. While simulations are essential for initial testing, they often have limitations in capturing the complex dynamics of real-world interactions. This limitation underscores the importance of small-scale testbeds. These testbeds provide a realistic, cost-effective, and controlled environment for testing and validating algorithms, acting as an essential intermediary between simulation and full-scale experiments. This work serves to facilitate researchers' efforts in identifying existing small-scale testbeds suitable for their experiments and provide insights for those who want to build their own. In addition, it delivers a comprehensive survey of the current landscape of these testbeds. We derive 62 characteristics of testbeds based on the well-known sense-plan-act paradigm and offer an online table comparing 23 small-scale testbeds based on these characteristics. The online table is hosted on our designated public webpage https://bassamlab.github.io/testbeds-survey, and we invite testbed creators and developers to contribute to it. We closely examine nine testbeds in this paper, demonstrating how the derived characteristics can be used to present testbeds. Furthermore, we discuss three ongoing ch


WeAT19 Regular Session, 407	Add to My Program
Tactile Sensing 2

Chair: Spiers, Adam	Imperial College London
Co-Chair: Kaboli, Mohsen	Eindhoven University of Technology ( TU/e) & BMW Group Research

08:30-08:35, Paper WeAT19.1	Add to My Program
ACROSS: A Deformation-Based Cross-Modal Representation for Robotic Tactile Perception

Zai El Amri, Wadhah	L3S Research Center
Kuhlmann, Malte Fabian	L3S Research Center
Navarro-Guerrero, Nicolás	Leibniz Universität Hannover
Keywords: Transfer Learning, Force and Tactile Sensing, Representation Learning Abstract: Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many valuable datasets obsolete. However, recreating similar datasets with newer sensor technologies is both tedious and time-consuming. Therefore, adapting these existing datasets for use with new setups and modalities is crucial. In response, we introduce ACROSS, a novel framework for translating data between tactile sensors by exploiting sensor deformation information. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of first converting the input signals into 3D deformation meshes. We then transition from the 3D deformation mesh of one sensor to the mesh of another, and finally convert the generated 3D deformation mesh into the corresponding output space. We demonstrate our approach to the most challenging problem of going from a low-dimensional tactile representation to a high-dimensional one. In particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile images. Our approach enables the continued use of valuable datasets and data exchange between groups with different setups.

08:35-08:40, Paper WeAT19.2	Add to My Program
Learning to Double Guess: An Active Perception Approach for Estimating the Center of Mass of Arbitrary Object

Jin, Shengmiao	University of Illinois Urbana-Champaign
Mo, Yuchen	University of Illinois, Urbana-Champaign
Yuan, Wenzhen	University of Illinois
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Perception-Action Coupling Abstract: Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object's center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on singular interactions and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and ActiveNet. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.

08:40-08:45, Paper WeAT19.3	Add to My Program
Learning In-Hand Translation Using Tactile Skin with Shear and Normal Force Sensing

Yin, Jessica	University of Pennsylvania
Qi, Haozhi	UC Berkeley
Malik, Jitendra	UC Berkeley
Pikul, James	University of Pennsylvania
Yim, Mark	University of Pennsylvania
Hellebrekers, Tess	Meta AI Research
Keywords: Force and Tactile Sensing, In-Hand Manipulation, Reinforcement Learning Abstract: Recent progress in reinforcement learning (RL) and tactile sensing has significantly advanced dexterous manipulation. However, these methods often utilize simplified tactile signals due to the gap between tactile simulation and the real world. We introduce a sensor model for tactile skin that enables zero-shot sim-to-real transfer of ternary shear and binary normal forces. Using this model, we develop an RL policy that leverages sliding contact for dexterous in-hand translation. We conduct extensive real-world experiments to assess how tactile sensing facilitates policy adaptation to various unseen object properties and robot hand orientations. We demonstrate that our 3-axis tactile policies consistently outperform baselines that use only shear forces, only normal forces, or only proprioception. Videos and details available on the project website.

08:45-08:50, Paper WeAT19.4	Add to My Program
Contrastive Touch-To-Touch Pretraining

Rodriguez, Samanta	University of Michigan - Ann Arbor
Dou, Yiming	University of Michigan
van den Bogert, William	University of Michigan
Oller, Miquel	University of Michigan
So, Kevin	University of Michigan
Owens, Andrew	University of Michigan
Fazeli, Nima	University of Michigan
Keywords: Representation Learning, Force and Tactile Sensing, Deep Learning in Grasping and Manipulation Abstract: Tactile sensors differ greatly in design, making it challenging to develop general-purpose methods for processing tactile feedback. In this paper, we introduce a contrastive self-supervised learning approach that represents tactile feedback across different sensor types. Our method utilizes paired tactile data—where two distinct sensors, in our case Soft Bubbles and GelSlims, grasp the same object in the same configuration—to learn a unified latent representation. Unlike current approaches that focus on reconstruction or task-specific supervision, our method employs contrastive learning to create a latent space that captures shared information between sensors. By treating paired tactile signals as positives and unpaired signals as negatives, we show that our model effectively learns a rich, sensor-agnostic representation. Despite significant differences between Soft Bubble and GelSlim sensors, the learned representation enables strong downstream task performance, including zero-shot and few-shot classification and pose estimation. This work provides a scalable solution for integrating tactile data across diverse sensor modalities, advancing the development of generalizable tactile representations.

08:50-08:55, Paper WeAT19.5	Add to My Program
ViTract: Robust Object Shape Perception Via Active Visuo-Tactile Interaction

Dutta, Anirvan	BMW Group and Imperial College London
Burdet, Etienne	Imperial College London
Kaboli, Mohsen	Eindhoven University of Technology ( TU/e) & BMW Group Research
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing Abstract: An essential problem in robotic systems that are to be deployed in unstructured environments is the accurate and autonomous perception of the shapes of previously unseen objects. Existing methods for shape estimation or reconstruction have leveraged either visual or tactile interactive exploration techniques, or have relied on comprehensive visual or tactile information acquired in an offline manner. In this work, a novel visuo-tactile interactive perception framework - ViTract is introduced for shape estimation of unseen objects. Our framework estimates the shape of diverse objects robustly using low-dimensional, efficient, and generalizable shape primitives, which are superquadrics. The probabilistic formulation within our framework takes advantage of the complementary information provided by vision and tactile observations while accounting for associated noise. As part of our framework, we propose a novel modality-specific information gain to select the most informative and reliable exploratory action (using vision/tactile) to obtain iterative visuo/tactile information. Our real-robot experiments demonstrate superior and robust performance compared to state-of-the-art visuo-tactile-based shape estimation techniques.

08:55-09:00, Paper WeAT19.6	Add to My Program
Location and Orientation Super-Resolution Sensing with a Cost-Efficient and Repairable Barometric Tactile Sensor

Hou, Jian	Imperial College London
Zhou, Xin	Imperial College London
Spiers, Adam	Imperial College London
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Grippers and Other End-Effectors, Barometric Sensing Abstract: The adoption of tactile sensors in robotics is hindered by their high cost and fragility. We designed and validated a cost-effective and robust barometric tactile sensor array, whose material cost is below 80 USD. Unlike past work, we do not mold the rubber surface over the barometers but instead keep it as a separate element, leading to a design that is easy to fabricate and repair. Machine learning techniques are applied to enhance the sensor’s localization precision, increasing the effective resolution from 6 mm (the distance between adjacent barometers) to 0.284 mm. To investigate the localization model’s robustness, we utilized an E-TRoll robotic gripper to roll differently shaped prismatic objects across the sensing surface mounted on one finger. Under these uncontrolled settings, we achieved a satisfactory real-time localization resolution of within 2.68 mm. Furthermore, we demonstrate a novel practical application: The E-TRoll mimics a 1-DoF parallel gripper inferring a cube’s orientation relative to the sensor. The range of orientations is split into 4 classes, which a trained CNN-LSTM model can predict with an 86.91% five-fold cross-validated accuracy.


WeAT20 Regular Session, 408	Add to My Program
Human Motion Sensing

Chair: Youcef-Toumi, Kamal	Massachusetts Institute of Technology
Co-Chair: Cao, Muqing	Carnegie Mellon University

08:30-08:35, Paper WeAT20.1	Add to My Program
Person Re-Identification for Robot Person Following with Online Continual Learning

Ye, Hanjing	Southern University of Science and Technology
Zhao, Jieting	Southern University of Science and Technology
Zhan, Yu	Southern University of Science and Technology
Chen, Weinan	Guangdong University of Technology
He, Li	Southern University of Science and Technology
Zhang, Hong	Southern University of Science and Technology
Keywords: Human-Centered Automation, Computer Vision for Automation, Continual Learning Abstract: Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person can often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she reappears within the robot's field of view. Previous person re-identification (ReID) approaches to person following rely on a fixed feature extractor. Such an approach often fails to generalize to different viewpoints and lighting conditions in practical RPF environments. In other words, it suffers from the so-called domain shift problem where it cannot re-identify the person when his re-appearance is out of the domain modeled by the fixed feature extractor. To mitigate this problem, we propose a ReID framework for RPF where we use a feature extractor that is optimized online with both short-term and long-term experiences (i.e., recently and previously observed samples during RPF) using the online continual learning (OCL) framework. The long-term experiences are maintained by a memory manager to enable OCL to update the feature extractor. Our experiments demonstrate that even in the presence of severe appearance changes and distractions from visually similar people, the proposed method can still re-identify the person more accurately than the state-of-the-art methods.

08:35-08:40, Paper WeAT20.2	Add to My Program
HelmetPoser: A Helmet-Mounted IMU Dataset for Data-Driven Estimation of Human Head Motion in Diverse Conditions

Li, Jianping	Nanyang Technological University
Leng, Qiutong	Nanyang Technological University
Liu, Jinxin	Nanyang Technological University
Xu, Xinhang	Nanyang Technological University
Jin, Tongxing	Nanyang Technological University
Cao, Muqing	Carnegie Mellon University
Nguyen, Thien-Minh	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Cao, Kun	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Datasets for Human Motion, Wearable Robotics, SLAM Abstract: Helmet-mounted wearable positioning systems are crucial for enhancing safety and facilitating coordination in industrial, construction, and emergency rescue environments. These systems, including LiDAR-Inertial Odometry (LIO) and Visual-Inertial Odometry (VIO), often face challenges in localization due to adverse environmental conditions such as dust, smoke, and limited visual features. To address these limitations, we propose a novel head-mounted Inertial Measurement Unit (IMU) dataset with ground truth, aimed at advancing data-driven IMU pose estimation. Our dataset captures human head motion patterns using a helmet-mounted system, with data from ten participants performing various activities. We explore the application of neural networks, specifically Long Short-Term Memory (LSTM) and Transformer networks, to correct IMU biases and improve localization accuracy. Additionally, we evaluate the performance of these methods across different IMU data window dimensions, motion patterns, and sensor types. We release a publicly available dataset, demonstrate the feasibility of advanced neural network approaches for helmet-based localization, and provide evaluation metrics to establish a baseline for future studies in this field. Data and code can be found at url{https://lqiutong.github.io/HelmetPoser.github.io/}

08:40-08:45, Paper WeAT20.3	Add to My Program
Relevance-Driven Decision Making for Safer and More Efficient Human Robot Collaboration

Zhang, Xiaotong	Massachusetts Institute of Technology
Huang, Dingcheng	Massachusetts Institute of Technology
Youcef-Toumi, Kamal	Massachusetts Institute of Technology
Keywords: Human-Robot Collaboration, Cognitive Modeling, Collision Avoidance Abstract: Human brain possesses the ability to effectively focus on important environmental components, which enhances perception, learning, reasoning, and decision-making. Inspired by this cognitive mechanism, we introduced a novel concept termed relevance for Human-Robot Collaboration (HRC). Relevance is a dimensionality reduction process that incorporates a continuously operating perception module, evaluates cue sufficiency within the scene, and applies a flexible formulation and computation framework. In this paper, we present an enhanced two-loop framework that integrates real-time and asynchronous processing to quantify relevance and leverage it for safer and more efficient human-robot collaboration (HRC). The two-loop framework integrates an asynchronous loop, which leverages an LLM’s world knowledge to quantify relevance, and a real-time loop, which performs scene understanding, human intent prediction, and decision-making based on relevance. HRC decision-making is enhanced by a relevance-based task allocation method, as well as a motion generation and collision avoidance approach that incorporates human trajectory prediction. Simulations and experiments show that our methodology for relevance quantification can accurately and robustly predict the human objective and relevance, with an average accuracy of up to 0.90 for objective prediction and up to 0.96 for relevance prediction. Moreover, our motion generation methodology reduces collision cases by 63.76% and collision frames by 44.74% when compared with a state-of-the-art (SOTA) collision avoidance method. Our framework and methodologies, with relevance, guide the robot on how to best assist humans and generate safer and more efficient actions for HRC.

08:45-08:50, Paper WeAT20.4	Add to My Program
Back to the Cartesian: Pilot Study for Assessing Human Stiffness in 3D Cartesian Space by Transforming from Muscle Space in a Peg-In-Hole Scenario for Tele-Impedance

Thuerauf, Sabine	Friedrich-Alexander-University Erlangen-Nuremberg
Mehrkens, Florian	FAU Erlangen-Nuernberg
Castellini, Claudio	Friedrich-Alexander-Universität Erlangen-Nürnberg
Sierotowicz, Marek	Friedrich-Alexander Universität Erlangen Nürnberg
Keywords: Telerobotics and Teleoperation, Compliance and Impedance Control, Intention Recognition Abstract: For various teleoperation tasks, position-based control is not practical. An impedance-based control is superior e.g. for handling fragile objects, like harvesting fruits or grasping a paper cup. However, only a few researchers have focused on impedance control for teleoperation. In tele-impedance, the stiffness of a human is measured and transferred to a controller of a robot. Until now, human stiffness was mostly measured either for specific joints or in 2D Cartesian space. We introduce a new way of measuring Cartesian stiffness in 3D using electromyography. Users were asked to perform a peg-in-hole task in three different orientations (0°, 45°, 90°). Meanwhile, electromyography measurements at shoulder and elbow muscle groups are performed. In a proof-of-concept study, we showed that the measured stiffness matrix in Cartesian space differed significantly across the three differently oriented peg-in-hole scenarios. This demonstrates that human stiffness could be predicted in 3D Cartesian space based on the type of task at hand.

08:50-08:55, Paper WeAT20.5	Add to My Program
Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images

Käs, Stephanie	RWTH Aachen University
Peter, Sven	RWTH Aachen University
Thillmann, Henrik	Chair for Computer Vision, RWTH Aachen University
Burenko, Anton	RWTH Aachen
Adrian, David Benjamin	Bosch Corporate Research & Ulm University
Mack, Dennis	Robert Bosch GmbH
Linder, Timm	Robert Bosch GmbH
Leibe, Bastian	RWTH Aachen University
Keywords: Gesture, Posture and Facial Expressions, Human Detection and Tracking, Omnidirectional Vision Abstract: Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel FISHnCHIPS dataset, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses.

08:55-09:00, Paper WeAT20.6	Add to My Program
HuMAn – the Human Motion Anticipation Algorithm Based on Recurrent Neural Networks

Noppeney, Victor	University of São Paulo
Escalante, Felix M	São Paulo State University
Maggi, Lucas	University of Sao Paulo
Boaventura, Thiago	University of Sao Paulo
Keywords: Modeling and Simulating Humans, Human and Humanoid Motion Analysis and Synthesis, Intention Recognition Abstract: Predicting human motion may lead to considerable advantages for human-robot interaction, particularly when precise synchronization between the robot’s motion and the user’s movement is imperative. The inherent stochastic nature of human behavior, combined with the restricted window of response, can give rise to residual and undesirable forces during interactions, potentially harming the user. Therefore, efficient prediction of human joint movements may enhance the performance of various interaction control frameworks used in wearable robots. This paper proposes the HuMAn algorithm for predicting human joint motion based on a recurrent neural network. This algorithm consists of a long-term memory network, used to interpret sequences of poses, and a prediction layer, employed to build the most likely future user poses within a specified time horizon. Network training was performed using datasets encompassing various subjects and types of motion. The results demonstrate the effectiveness of the proposed algorithm, as evidenced by average general prediction errors below 0.1 radians for predictive horizons of up to 500 milliseconds. Furthermore, a mean absolute error of 0.026 radians was achieved for a periodic treadmill walk. Simulation results demonstrate a large improvement in transparency control performance in a case study with an upper limb exoskeleton robot.


WeAT21 Regular Session, 410	Add to My Program
Robot Foundation Models 1

Chair: Li, Hui	Autodesk Research
Co-Chair: Nguyen, Anh	University of Liverpool

08:30-08:35, Paper WeAT21.1	Add to My Program
Robotic-CLIP: Fine-Tuning CLIP on Action Data for Robotic Applications

Nguyen, Nghia	FPT Software Company Limited
Vu, Minh Nhat	TU Wien, Austria
Ta, Tung D.	The University of Tokyo
Huang, Baoru	Imperial College London
Vo, Thieu	National University of Singapore
Le, Ngan	University of Arkansas
Nguyen, Anh	University of Liverpool
Keywords: Perception-Action Coupling, Representation Learning Abstract: Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.

08:35-08:40, Paper WeAT21.2	Add to My Program
In-Context Imitation Learning Via Next-Token Prediction

Fu, Letian	UC Berkeley
Huang, Huang	University of California at Berkeley
Datta, Gaurav	UC Berkeley
Chen, Lawrence Yunliang	UC Berkeley
Panitch, William	University of California, Berkeley
Liu, Fangchen	University of California, Berkeley
Li, Hui	Autodesk Research
Goldberg, Ken	UC Berkeley
Keywords: Learning from Demonstration, Imitation Learning, Data Sets for Robot Learning Abstract: In-context imitation learning is the capability to perform novel tasks when prompted with task demonstration examples. In-Context Robot Transformer (ICRT) is a causal transformer that performs autoregressive prediction on sensorimotor trajectories, which include images, proprioceptive states, and actions. This approach supports flexible and trainingfree execution of new tasks at test time. Experiments with a Franka Emika robot demonstrate that ICRT can adapt to new environment configurations that differ from both the prompt and the training data. In a multi-task environment setup, ICRT significantly outperforms current state-of-the-art robot foundation models on generalization to unseen tasks. Code, data, and appendix are available on https://icrt.dev.

08:40-08:45, Paper WeAT21.3	Add to My Program
Data Augmentation for NeRFs in the Low Data Limit

Gaggar, Ayush	Northwestern University
Murphey, Todd	Northwestern University
Keywords: Incremental Learning, Deep Learning for Visual Perception, Planning under Uncertainty Abstract: Current methods based on Neural Radiance Fields fail in the low data limit, particularly when training on incomplete scene data. Prior works augment training data only in next-best-view applications, which lead to hallucinations and model collapse with sparse data. In contrast, we propose adding a set of views during training by rejection sampling from a posterior uncertainty distribution, generated by combining a volumetric uncertainty estimator with spatial coverage. We validate our results on partially observed scenes; on average, our method performs 39.9% better with 87.5% less variability across established scene reconstruction benchmarks, as compared to state of the art baselines. We further demonstrate that augmenting the training set by sampling from any distribution leads to better, more consistent scene reconstruction in sparse environments. This work is foundational for robotic tasks where augmenting a dataset with informative data is critical in resource-constrained, a priori unknown environments. Videos and source code are available at https://murpheylab.github.io/low-data-nerf.

08:45-08:50, Paper WeAT21.4	Add to My Program
Generalizable Imitation Learning through Pre-Trained Representations

Chang, Wei-Di	McGill University
Hogan, Francois	Massachusetts Institute of Technology
Fujimoto, Scott	McGill University
Meger, David Paul	McGill University
Dudek, Gregory	McGill University
Keywords: Imitation Learning, Learning from Demonstration, Representation Learning Abstract: In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.

08:50-08:55, Paper WeAT21.5	Add to My Program
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors Via Language Grounding

Jones, Joshua	University of California, Berkeley
Mees, Oier	University of California, Berkeley
Sferrazza, Carmelo	UC Berkeley
Stachowicz, Kyle	University of California, Berkeley
Abbeel, Pieter	UC Berkeley
Levine, Sergey	UC Berkeley
Keywords: Big Data in Robotics and Automation, Sensorimotor Learning, Learning from Demonstration Abstract: Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSe is able to increase success rates by over 20% compared to all considered baselines.

08:55-09:00, Paper WeAT21.6	Add to My Program
Simultaneous Geometry and Pose Estimation of Held Objects Via 3D Foundation Models

Zhi, Weiming	Carnegie Mellon University
Tang, Haozhan	Carnegie Mellon University
Zhang, Tianyi	Carnegie Mellon University
Johnson-Roberson, Matthew	Carnegie Mellon University
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Deep Learning Methods Abstract: Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.


WeAT22 Regular Session, 411	Add to My Program
Learning for Robot Control

Chair: Bonsignorio, Fabio	FER, University of Zagreb
Co-Chair: P. Vinod, Abraham	Mitsubishi Electric Research Laboratories

08:30-08:35, Paper WeAT22.1	Add to My Program
Gradient Descent-Based Task-Orientation Robot Control Enhanced with Gaussian Process Predictions

Roveda, Loris	SUPSI-IDSIA
Pavone, Marco	Stanford University
Keywords: Machine Learning for Robot Control, Model Learning for Control, Compliance and Impedance Control Abstract: This paper proposes a novel force-based task-orientation controller for interaction tasks with environmental orientation uncertainties. The main aim of the controller is to align the robot tool along the main task direction (e.g., along screwing, insertion, polishing, etc.) without the use of any external sensors (e.g., vision systems), relying only on end-effector wrench measurements/estimations. We propose a gradient descent-based orientation controller, enhancing its performance with the orientation predictions provided by a Gaussian Process model. Derivation of the controller is presented, together with simulation results (considering a probing task) and experimental results involving various re-orientation scenarios, i.e., i) a task with the robot in interaction with a soft environment, ii) a task with the robot in interaction with a stiff and inclined environment, and iii) a task to enable the assembly of a gear into its shaft. The proposed controller is compared against a state-of-the-art approach, highlighting its ability to re-orient the robot tool even in complex tasks (where the state-of-the-art method fails).

08:35-08:40, Paper WeAT22.2	Add to My Program
Model-Free Inverse H-Infinity Control for Imitation Learning (I)

Xue, Wenqian	University of Florida
Lian, Bosen	Auburn University
Kartal, Yusuf	Turkish Aerospace
Fan, Jialu	Northeastern University
Chai, Tianyou	Northeastern University, Shenyang, China
Lewis, Frank	The University of Texas at Arlington
Keywords: Reinforcement Learning, Imitation Learning, Machine Learning for Robot Control Abstract: This paper proposes a data-driven model-free inverse reinforcement learning (IRL) algorithm tailored for solving an inverse H_infty control problem. In the problem, both an expert and a learner engage in H_infty control to reject disturbances and the learner's objective is to imitate the expert's behavior by reconstructing the expert's performance function through IRL techniques. Introducing zero-sum game principles, we first formulate a model-based single-loop IRL policy iteration algorithm that includes three key steps: updating the policy, action, and performance function using a new correction formula and the standard inverse optimal control principles. Building upon the model-based approach, we propose a model-free single-loop off-policy IRL algorithm that eliminates the need for initial stabilizing policies and prior knowledge of the dynamics of expert and learner. Also, we provide rigorous proof of convergence, stability, and Nash optimality to guarantee the effectiveness and reliability of the proposed algorithms. Furthermore, we showcase the efficiency of our algorithm through simulations and experiments, highlighting its advantages compared to the existing methods.

08:40-08:45, Paper WeAT22.3	Add to My Program
Learning Object Properties Using Robot Proprioception Via Differentiable Robot-Object Interaction

Chen, Peter Yichen	MIT
Liu, Chao	Massachusetts Institute of Technology
Ma, Pingchuan	MIT CSAIL
Eastman, John	Massachusetts Institute of Technology
Rus, Daniela	MIT
Randle, Dylan Labatt	Amazon Robotics
Ivanov, Yuri	Amazon
Matusik, Wojciech	MIT
Keywords: Machine Learning for Robot Control, Sensorimotor Learning, Learning from Demonstration Abstract: Differentiable simulation has become a powerful tool for system identification. While prior work has focused on identifying robot properties using robot-specific data or object properties using object-specific data, our approach calibrates object properties by using information from the robot, without relying on data from the object itself. Specifically, we utilize robot joint encoder information, which is commonly available in standard robotic systems. Our key observation is that by analyzing the robot's reactions to manipulated objects, we can infer properties of those objects, such as inertia and softness. Leveraging this insight, we develop differentiable simulations of robot-object interactions to inversely identify the properties of the manipulated objects. Our approach relies solely on proprioception — the robot’s internal sensing capabilities — and does not require external measurement tools or vision-based tracking systems. This general method is applicable to any articulated robot and requires only joint position information. We demonstrate the effectiveness of our method on a low-cost robotic platform, achieving accurate mass and elastic modulus estimations of manipulated objects with just a few seconds of computation on a laptop.

08:45-08:50, Paper WeAT22.4	Add to My Program
Reservoir Computing Encodes Physical Adaptations for Reinforcement Learning

Giannetto, Cross	CCIR
Ibragim, Atadjanov	Kyung Hee University
Iida, Fumiya	University of Cambridge
Abdulali, Arsen	Cambridge University
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Reinforcement Learning Abstract: Adapting reinforcement learning (RL) policies to various robot body configurations is a significant challenge for creating flexible autonomous systems. This study presents a novel framework that integrates Reservoir Computing (RC) with the First-Order Reduced and Controlled Error (FORCE) learning rule to enhance policy adaptability in RL. The RC serves as a dynamic feature extractor, capturing temporal dependencies by pre-training on state transitions generated through random actions. This pre-training acts as regularization, reducing variance and preventing overfitting to specific configurations Subsequently, the control policy network is trained on a limited set of body variations using the enriched features from the RC. Experimental results across three distinct environments demonstrate that the proposed RC+FORCE framework significantly improves policy performance and adaptability to unseen robot configurations compared to traditional reinforcement learning through domain randomization. These findings highlight the effectiveness of combining RC-based feature extraction with FORCE-based training in developing robust RL agents.

08:50-08:55, Paper WeAT22.5	Add to My Program
Self-Supervised Meta-Learning for All-Layer DNN-Based Adaptive Control with Stability Guarantees

He, Guanqi	Carnegie Mellon University
Choudhary, Yogita	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Machine Learning for Robot Control, Aerial Systems: Mechanics and Control, Robust/Adaptive Control Abstract: A critical goal of adaptive control is enabling robots to rapidly adapt in dynamic environments. Recent studies have developed a meta-learning-based adaptive control scheme, which uses meta-learning to extract nonlinear features (represented by Deep Neural Networks (DNNs)) from offline data, and uses adaptive control to update linear coefficients online. However, such a scheme is fundamentally limited by the linear parameterization of uncertainties and does not fully unleash the capability of DNNs. This paper introduces a novel learning-based adaptive control framework that pretrains a DNN via self-supervised meta-learning (SSML) from offline trajectories and online adapts the full DNN via composite adaptation. In particular, the offline SSML stage leverages the time consistency in trajectory data to train the DNN to predict future disturbances from history, in a self-supervised manner without environment condition labels. The online stage carefully designs a control law and an adaptation law to update the full DNN with stability guarantees. Empirically, the proposed framework significantly outperforms (19-39%) various classic and learning-based adaptive control baselines, in challenging real-world quadrotor tracking problems under large dynamic wind disturbance.

08:55-09:00, Paper WeAT22.6	Add to My Program
Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation

Luo, Jing Yuan	ETH Zurich
Song, Yunlong	University of Zurich
Klemm, Victor	ETH Zurich
Shi, Fan	National University of Singapore
Scaramuzza, Davide	University of Zurich
Hutter, Marco	ETH Zurich
Keywords: Machine Learning for Robot Control, Vision-Based Navigation, Legged Robots Abstract: First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.


WeAT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 3

Chair: Wang, Shenlong	University of Illinois at Urbana-Champaign
Co-Chair: Chen, Yong-Sheng	National Yang Ming Chiao Tung University

08:30-08:35, Paper WeAT23.1	Add to My Program
METDrive: Multimodal End-To-End Autonomous Driving with Temporal Guidance

Guo, Ziang	Skolkovo Institute of Science and Technology
Lin, Xinhao	Insititute of Automation, Qilu University of Technology (Shandon
Yagudin, Zakhar	Skolkovo Institute of Science and Technology
Lykov, Artem	Skolkovo Institute of Science and Technology
Wang, Yong	Insititute of Automation, Qilu University of Technology (Shandon
Li, Yanqiang	Institute of Automation, Qilu University of Technology (Shandong
Tsetserukou, Dzmitry	Skolkovo Institute of Science and Technology
Keywords: Imitation Learning, Integrated Planning and Learning, Sensor Fusion Abstract: Multimodal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system’s understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard benchmarks, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.

08:35-08:40, Paper WeAT23.2	Add to My Program
Generalizing Motion Planners with Mixture of Experts for Autonomous Driving

Sun, Qiao	Shanghai QiZhi Institute
Wang, Huimin	Li Auto
Zhan, Jiahao	Fudan University
Nie, Fan	Stanford University
Wen, Xin	Li Auto
Xu, Leimeng	Li Auto
Zhan, Kun	LiAuto
Jia, Peng	Li Auto
Lang, Xianpeng	LiAuto
Zhao, Hang	Tsinghua University
Keywords: Learning from Demonstration, Representation Learning, Imitation Learning Abstract: Large real-world driving datasets have sparked significant research into various aspects of learning-based motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. In this paper, we review and benchmark previous methods. Experiments show that many of these approaches have limited generalization abilities in planning performance due to overly complex designs or training paradigms. Experiments further reveal that as models are appropriately scaled, many designs become redundant. Therefore, we introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner. STR2uses a Vision Transformer (ViT) encoder and a mix-of-experts (MoE) causal transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. We evaluate its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.

08:40-08:45, Paper WeAT23.3	Add to My Program
Low-Rank Adaptation-Based All-Weather Removal for Autonomous Navigation

Rajagopalan, Sudarshan	Johns Hopkins University
Patel, Vishal	Johns Hopkins University
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation Abstract: All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation. Project page: https://sudraj2002.github.io/loraapage/.

08:45-08:50, Paper WeAT23.4	Add to My Program
Stands on Shoulders of Giants: Learning to Lift 2D Detection to 3D with Geometry-Driven Objectives

Chen, Jhih Rong	National Yang Ming Chiao Tung University
Chang, Che Yuan	National Yang Ming Chiao Tung University
Tseng, Szu Han	Elan Microelectronics Corporation
Huang, Chih Sheng	Elan Microelectronics Corporation
Chen, Yong-Sheng	National Yang Ming Chiao Tung University
Chiu, Wei-Chen	National Chiao Tung University
Keywords: Computer Vision for Automation, AI-Based Methods, Vision-Based Navigation Abstract: 3D detection of vehicles is an essential component for autonomous driving applications. Nevertheless, collecting the supervised training data for learning 3D vehicle detectors would be costly (e.g. utilization of expensive LiDAR sensors) and labor-intensive (for human annotation). In comparison to 3D detection, 2D object detection has achieved a well-developed status, boosting stable and robust performance with widespread application in numerous fields, thanks to the large scale (i.e. amount of samples) of existing training datasets of 2D object detection. Hence, in our work, we propose to realize 3D detection via leveraging the robustness of 2D detectors and developing a network that lifts 2D detections to 3D. With the flexibility of building upon various backbone models (e.g. the models which take image regions detected by 2D detector as inputs to predict their corresponding 3D bounding boxes, or the existing monocular 3D detection models which have the intermediate output of 2D bounding boxes), we propose several geometry-driven objectives, including projection consistency loss, geometry depth loss, and opposite bin loss, to improve the training upon 2D-to-3D lifting. Our extensive experimental results demonstrate that our proposed geometry-driven objectives not only contribute to the superior results of 3D detection but also provide better generalizability across datasets.

08:50-08:55, Paper WeAT23.5	Add to My Program
LidarDM: Generative LiDAR Simulation in a Generated World

Zyrianov, Vlas	University of Illinois at Urbana-Champaign
Che, Henry	University of Illinois, Urbana-Champaign
Liu, Zhijian	Massachusetts Institute of Technology
Wang, Shenlong	University of Illinois at Urbana-Champaign
Keywords: Autonomous Vehicle Navigation, Simulation and Animation, AI-Based Methods Abstract: We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models.

08:55-09:00, Paper WeAT23.6	Add to My Program
RenderWorld: World Model with Self-Supervised 3D Label

Yan, Ziyang	University of Trento
Dong, Wenzhen	The Chinese University of HongKong
Shao, Yihua	University of Science and Technology Beijing
Lu, Yuhang	ShanghaiTech University
Liu, Haiyang	University of Science and Technology Beijing
Liu, Jingwen	University of Science and Technology Beijing
Wang, Haozhe	Hong Kong University of Science and Technology
Wang, Zhe	Institute for AI Industry Research, Tsinghua University
Wang, Yan	Tsinghua University
Remondino, Fabio	FBK
Ma, Yuexin	ShanghaiTech University
Keywords: Computer Vision for Automation, Planning, Scheduling and Coordination, Object Detection, Segmentation and Categorization Abstract: End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and use world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, lead- ing to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.


WeBT1 Regular Session, 302	Add to My Program
Award Finalists 6

Chair: Fanti, Maria Pia	Politecnico Di Bari
Co-Chair: Han, Amy Kyungwon	Seoul National University

09:55-10:00, Paper WeBT1.1	Add to My Program
In-Plane Manipulation of Soft Micro-Fiber with Ultrasonic Transducer Array and Microscope

Zou, Jieyun	Shanghaitech University
An, Siyuan	Shanghaitech University
Wang, Mingyue	Shanghaitech Univerisity
Li, Jiaqi	ShanghaiTech University
Shi, Yalin	The School of Control Science and Engineering (CSE) of Shandong
Li, You-Fu	City University of Hong Kong
Liu, Song	ShanghaiTech University
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Nanomanufacturing Abstract: Noncontact manipulation of soft micro-fibers has great potential in advanced manufacturing, materials science, and biomedical engineering. However, current noncontact manipulation techniques primarily focus on objects with regular shapes, e.g., solid particles, cells, or droplets, with fewer solutions available for manipulating flexible and elongated structures. In this paper, an automated ultrasonic manipulation system is introduced for in-plane soft micro-fiber manipulation, which mainly consists of an ultrasonic transducer array and a microscope. A real-time trap generation algorithm is designed to manipulate the micro-fibers by the visual feedback from microscope. An adequate theoretical analysis is also provided for explanation of the deformation behavior of micro-fiber under external forces. The system is capable of precise in-plane positioning and motion trajectory planning to micro-fiber end, and in-plane morphological reshaping to the micro-fiber. Experiments validated the effectiveness of the proposed system for the in-plane manipulation of soft micro-fibers. Finally, the system was showcased by the practical application of material property characterization.

10:00-10:05, Paper WeBT1.2	Add to My Program
A Complete and Bounded-Suboptimal Algorithm for a Moving Target Traveling Salesman Problem with Obstacles in 3D

Bhat, Anoop	Carnegie Mellon University
Gutow, Geordan	Carnegie Mellon University
Vundurthy, Bhaskar	Carnegie Mellon University
Ren, Zhongqiang	Shanghai Jiao Tong University
Rathinam, Sivakumar	TAMU
Choset, Howie	Carnegie Mellon University
Keywords: Motion and Path Planning, Constrained Motion Planning, Optimization and Optimal Control Abstract: The moving target traveling salesman problem with obstacles (MT-TSP-O) seeks an obstacle-free trajectory for an agent that intercepts a given set of moving targets, each within a specified time windows, and returns to the agent's starting position. Each target moves with a constant velocity within its time windows, and the agent has a speed limit no smaller than any target's speed. We present FMC-TSP, the first complete and bounded-suboptimal algorithm for the MT-TSP-O, and results for an agent whose configuration space is in R^3. Our algorithm interleaves a high-level search and a low-level search where the high-level search solves a generalized traveling salesman problem with time windows (GTSP-TW) to find a sequence of targets and corresponding time windows for the agent to visit. Given such a sequence, the low-level search then finds an associated agent trajectory. To solve the low-level planning problem, we develop a new algorithm called FMC, which finds a shortest path on a graph of convex sets (GCS) via implicit graph search and pruning techniques specialized for problems with moving targets. We test FMC*-TSP on 280 problem instances with up to 40 targets and demonstrate its smaller median runtime than a baseline based on prior work.

10:05-10:10, Paper WeBT1.3	Add to My Program
Physics-Aware Robotic Palletization with Online Masking Inference

Zhang, Tianqi	Tsinghua University
Wu, Zheng	University of California, Berkeley
Chen, Yuxin	University of California, Berkeley
Wang, Yixiao	University of California, Berkeley
Liang, Boyuan	University of California, Berkeley
Moura, Scott	UC Berkeley
Tomizuka, Masayoshi	University of California
Ding, Mingyu	UC Berkeley
Zhan, Wei	Univeristy of California, Berkeley
Keywords: Task Planning, Reinforcement Learning Abstract: The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy towards valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.

10:10-10:15, Paper WeBT1.4	Add to My Program
Image-Based Compliance Control for Robotic Steering of a Ferromagnetic Guidewire

Hu, An	University of Toronto
Sun, Chen	University of Toronto
Dmytriw, Adam	Neurovascular Centre, Divisions of Therapeutic Neuroradiology An
Xiao, Nan	Beijing Institute of Technology
Sun, Yu	University of Toronto
Keywords: Surgical Robotics: Steerable Catheters/Needles, Visual Servoing, Compliance and Impedance Control Abstract: Robotic steering of magnetic guidewires has shown great potential in accelerating endovascular interventions, enhancing the success rate of time-sensitive surgeries such as stroke treatment. Incomplete state feedback of the guidewire from 2D perspective images and unknown interactions with the surrounding vessel wall raise challenges in modeling and steering control. These two factors, however, are commonly overlooked by existing works. In this paper, 2D perspective images of the guidewire, which comply with prevalent medical imaging modalities, are used as the only feedback. A model-based external force observer is proposed that allows the guidewire to perceive the unknown interactions, and a compliance controller is subsequently designed to handle the external force while steering the guidewire. Experiments conducted in a human-sized phantom demonstrate how the compliance controller preserves stability and safety by adapting to the estimated external force.

10:15-10:20, Paper WeBT1.5	Add to My Program
AutoPeel: Adhesion-Aware Safe Peeling Trajectory Optimization for Robotic Wound Care

Liang, Xiao	University of California San Diego
Zhang, Youcheng	University of California San Diego
Liu, Fei	University of Tennessee Knoxville
Richter, Florian	University of California, San Diego
Yip, Michael C.	University of California, San Diego
Keywords: Medical Robots and Systems, Human-Centered Automation, Surgical Robotics: Planning Abstract: Chronic wounds, including diabetic ulcers, pressure ulcers, and ulcers secondary to venous hypertension, affects more than 6.5 million patients and a yearly cost of more than 25 billion in the United States alone.Chronic wound treatment is currently a manual process, and we envision a future where robotics and automation will aid in this treatment to reduce cost and improve patient care. In this work, we present the development of the first robotic system for wound dressing removal which is reported to be the worst aspect of living with chronic wounds. Our method leverages differentiable physics-based simulation to perform gradient-based Model Predictive Control (MPC) for optimized trajectory planning. By integrating fracture mechanics of adhesion, we are able to model the peeling effect inherent to dressing adhesion. The system is further guided by carefully designed objective functions that promote both efficient and safe control, reducing the risk of tissue damage. We validated the efficacy of our approach through a series of experiments conducted on both synthetic skin phantoms and real human subjects. Our results demonstrate the system's ability to achieve precise and safe dressing removal trajectories, offering a promising solution for automating this essential healthcare procedure.

10:20-10:25, Paper WeBT1.6	Add to My Program
In-Vivo Tendon-Driven Rodent Ankle Exoskeleton System for Sensorimotor Rehabilitation

Han, Juwan	POSTECH, Pohang University of Science and Technology
Park, Seunghyeon	POSTECH, Pohang University of Science and Technology
Kim, Keehoon	POSTECH, Pohang University of Science and Technology
Keywords: Prosthetics and Exoskeletons, Neurorobotics, Wearable Robotics Abstract: This paper introduces a novel cable-driven rodent ankle exoskeleton system designed for in-vivo research on the restoration and enhancement of sensorimotor abilities. The system features a lightweight, actuator-decoupled exoskeleton for shaping motion and providing kinesthetic feedback, along with a vision system and feedback-controlled treadmill for gait analysis. Experiments conducted under anesthesia and in awake conditions demonstrated effective control with minimal interference to natural gait. Dynamic time warping distance and Pearson correlation coefficients were calculated between joint angles from natural gait and those from rats wearing both passive and active exoskeleton component. The knee joint showed a low DTW distance and high correlation regardless of conditions, while all three joint displayed a greater maximum value from natural gait when the active component was engaged. These results provide valuable insights into the physiological impacts of wearable robotics in animal models, advancing sensorimotor rehabilitation technologies.

10:25-10:30, Paper WeBT1.7	Add to My Program
Stable Tracking of Eye Gaze Direction During Ophthalmic Surgery

Hong, Tinghe	Sun Yat-Sen University
Cai, Shenlin	Sun Yat-Sen University
Li, Boyang	Sun Yat-Sen University
Huang, Kai	Sun Yat-Sen University
Keywords: Surgical Robotics: Planning, Motion and Path Planning, Recognition Abstract: Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces, such as retinal surgery. Despite significant advancements in developing vision- and force-based control methods for these robots, preoperative navigation remains heavily reliant on manual operation, limiting consistency and increasing uncertainty. Existing eye gaze estimation techniques, whether traditional or deep learning-based, face challenges such as dependence on additional sensor devices, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the need for landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. The proposed method achieves an average error of 0.58 degrees in estimating eye orientation and an average error of 2.08 degrees in controlling the robotic arm's movement based on the calculated orientation.


WeBT2 Regular Session, 301	Add to My Program
SLAM 4

Chair: Rosen, David	Northeastern University
Co-Chair: De Cristóforis, Pablo	University of Buenos Aires

09:55-10:00, Paper WeBT2.1	Add to My Program
Introspective Loop Closure for SLAM with 4D Imaging Radar

Hilger, Maximilian	Technical University of Munich
Kubelka, Vladimir	Örebro University
Adolfsson, Daniel	Örebro University
Becker, Ralf	Company Bosch Rexroth
Andreasson, Henrik	Örebro University
Lilienthal, Achim J.	Orebro University
Keywords: SLAM, Mapping, Localization Abstract: Simultaneous Localization and Mapping (SLAM) allows mobile robots to navigate without external positioning systems or pre-existing maps. Radar is emerging as a valuable sensing tool, especially in vision-obstructed environments, as it is less affected by particles than lidars or cameras. Modern 4D imaging radars provide three-dimensional geometric information and relative velocity measurements, but they bring challenges such as a small field of view and sparse, noisy point clouds. Detecting loop closures in SLAM is critical for reducing trajectory drift and maintaining map accuracy. However, the directional nature of 4D radar data makes identifying loop closures, especially from reverse viewpoints, difficult due to limited scan overlap. This article explores using 4D radar for loop closure in SLAM, focusing on similar and opposing viewpoints. We generate submaps for a denser environment representation and use introspective measures to reject false detections in feature-degenerate environments. Our experiments show accurate loop closure detection in geometrically diverse settings for both similar and opposing viewpoints, improving trajectory estimation with up to 82 % improvement in ATE and rejecting false positives in self-similar environments.

10:00-10:05, Paper WeBT2.2	Add to My Program
Range-Based 6-DoF Monte Carlo SLAM with Gradient-Guided Particle Filter on GPU

Nakao, Takumi	Nagoya University
Koide, Kenji	National Institute of Advanced Industrial Science and Technology
Takanose, Aoki	National Institute of Advanced Industrial Science and Technology
Oishi, Shuji	National Institute of Advanced Industrial Science and Technology
Yokozuka, Masashi	Nat. Inst. of Advanced Industrial Science and Technology
Date, Hisashi	University of Tsukuba
Keywords: SLAM, Mapping, Range Sensing Abstract: This paper presents range-based 6-DoF Monte Carlo SLAM with a gradient-guided particle update strategy. While non-parametric state estimation methods, such as particle filters, are robust in situations with high ambiguity, they are known to be unsuitable for high-dimensional problems due to the curse of dimensionality. To address this issue, we propose a particle update strategy that improves the sampling efficiency by using the gradient information of the likelihood function to guide particles toward its local maxima. Additionally, we introduce a keyframe-based map representation that represents the global map as a set of past frames (i.e., keyframes) to mitigate memory consumption. The keyframe poses for each particle are corrected using a simple loop closure method to maintain trajectory consistency. The combination of gradient information and keyframe-based map representation significantly enhances sampling efficiency and reduces memory usage compared to traditional RBPF approaches. To process a large number of particles (e.g., 100,000 particles) in real-time, the proposed framework is designed to fully exploit GPU parallel processing. Experimental results demonstrate that the proposed method exhibits extreme robustness to state ambiguity and can even deal with kidnapping situations, such as when the sensor moves to different floors via an elevator, with minimal heuristics.

10:05-10:10, Paper WeBT2.3	Add to My Program
Distributed Certifiably Correct Range-Aided SLAM

Thoms, Alexander	University of California Los Angeles
Papalia, Alan	Massachusetts Institute of Technology
Velasquez, Jared	University of California, Los Angeles
Rosen, David	Northeastern University
Narasimhan, Sriram	University of California, Los Angeles
Keywords: Multi-Robot SLAM, Range Sensing Abstract: Reliable simultaneous localization and mapping (SLAM) algorithms are necessary for safety-critical autonomous navigation. In the communication-constrained multi-agent setting, navigation systems increasingly use point-to-point range sensors as they afford measurements with low bandwidth requirements and known data association. The state estimation problem for these systems takes the form of range-aided (RA) SLAM. However, distributed algorithms for solving the RA-SLAM problem lack formal guarantees on the quality of the returned estimate. To this end, we present the first distributed algorithm for RA-SLAM that can efficiently recover certifiably globally optimal solutions. Our algorithm, distributed certifiably correct RA-SLAM (DCORA), achieves this via the Riemannian Staircase method, where computational procedures developed for distributed certifiably correct pose graph optimization are generalized to the RA-SLAM problem. We demonstrate DCORA's efficacy on real-world multi-agent datasets by achieving absolute trajectory errors comparable to those of a state-of-the-art centralized certifiably correct RA-SLAM algorithm. Additionally, we perform a parametric study on the structure of the RA-SLAM problem using synthetic data, revealing how common parameters affect DCORA's performance.

10:10-10:15, Paper WeBT2.4	Add to My Program
CoVoxSLAM: GPU Accelerated Globally Consistent Dense SLAM

Hoss, Emiliano	University of Buenos Aires
De Cristóforis, Pablo	University of Buenos Aires
Keywords: SLAM, Mapping, Embedded Systems for Robotic and Automation Abstract: A dense SLAM system is essential for mobile robots, as it provides localization and allows navigation, path planning, obstacle avoidance, and decision making in unstructured environments. Due to increasing computational demands, the use of GPUs in dense SLAM is expanding. In this work, we present coVoxSLAM, a novel GPU-accelerated volumetric SLAM system that takes full advantage of the parallel processing power of the GPU to build globally consistent maps even in large-scale environments. It was deployed on different platforms (discrete and embedded GPUs) and compared with the state-of-the-art. The results obtained using public datasets show that coVoxSLAM delivers a significant performance improvement considering execution times while maintaining accurate localization. The presented system is available as an open-source system on GitHub: https://github.com/lrse-uba/coVoxSLAM.

10:15-10:20, Paper WeBT2.5	Add to My Program
Radar4VoxMap: Accurate Odometry from Blurred Radar Observations

Seok, Jiwon	Hanyang University
Kim, Soyeong	Hanyang University
Jo, Jaeyoung	Konkuk University, Smart Vehicle Engineering
Lee, Jaehwan	Hanyang University
Minseo, Jung	Hanyang
Jo, Kichun	Hanyang University
Keywords: SLAM, Mapping, Range Sensing Abstract: Compared to conventional 3D radar, the 4D imaging radar provides additional height data and finer resolution measurements. Moreover, compared to LiDAR sensors, 4D imaging radar is more cost-effective and offers enhanced durability against challenging weather conditions. Despite these advantages, radar-based localization systems face several challenges, including limited resolution, leading to scattered object recognition and less precise localization. Additionally, existing methods that form submaps from filtered results can accumulate errors, leading to blurred submaps and reducing the accuracy of the SLAM and odometry. To address these challenges, this paper introduces Radar4VoxMap, a novel approach designed to enhance radar-only odometry. The method includes an RCS-weighted voxel distribution map that improves registration accuracy. Furthermore, fixed-lag optimization with the graph is used to optimize both the submap and pose, effectively reducing cumulative errors. The proposed method has shown strong performance on open datasets. The code is available at: url{https://github.com/ailab-hanyang/Radar4VoxMap

10:20-10:25, Paper WeBT2.6	Add to My Program
GenZ-ICP: Generalizable and Degeneracy-Robust LiDAR Odometry Using an Adaptive Weighting

Lee, Daehan	Pohang University of Science and Technology
Lim, Hyungtae	Massachusetts Institute of Technology
Han, Soohee	Pohang University of Science and Technology ( POSTECH )
Keywords: SLAM, Localization, Mapping Abstract: Light detection and ranging (LiDAR)-based odometry has been widely utilized for pose estimation due to its use of high-accuracy range measurements and immunity to ambient light conditions. However, the performance of LiDAR odometry varies depending on the environment and deteriorates in degenerative environments such as long corridors. This issue stems from the dependence on a single error metric, which has different strengths and weaknesses depending on the geometrical characteristics of the surroundings. To address these problems, this study proposes a novel iterative closest point (ICP) method called GenZ-ICP. We revisited both point-to-plane and point-topoint error metrics and propose a method that leverages their strengths in a complementary manner. Moreover, adaptability to diverse environments was enhanced by utilizing an adaptive weight that is adjusted based on the geometrical characteristics of the surroundings. As demonstrated in our experimental evaluation, the proposed GenZ-ICP exhibits high adaptability to various environments and resilience to optimization degradation in corridor-like degenerative scenarios by preventing ill-posed problems during the optimization process. Our code is available at https://github.com/cocel-postech/genz-icp.

10:25-10:30, Paper WeBT2.7	Add to My Program
Free-Init: Scan-Free, Motion-Free, and Correspondence-Free Initialization for Doppler LiDAR-Inertial Systems

Zhao, Mingle	University of Macau
Wang, Jiahao	University of Macau
Gao, Tianxiao	University of Macao
Xu, Chengzhong	University of Macau
Kong, Hui	University of Macau
Keywords: SLAM, Localization, Mapping Abstract: Robust initialization is crucial for online systems. In the letter, a high-frequency and resilient initialization framework is designed for LiDAR-inertial systems, leveraging both inertial sensors and Doppler LiDAR. The innovative FMCW Doppler LiDAR opens up a novel avenue for robotic sensing by capturing not only point range but also Doppler velocity via the intrinsic Doppler effect. By fusing point-wise Doppler velocity with inertial measurements under non-inertial kinematics, the proposed framework, Free-Init, eliminates reliance on motion undistortion of LiDAR scans, excitation motions, and map correspondences during the initialization phase. Free-Init is also plug-and-play compatible with typical LiDAR-inertial systems and is versatile to handle a wide range of initial motions when the system starts, including stationary, dynamic, and even violent motions. The embedded Doppler-inertial velocimeter ensures fast convergence and high-frequency performance, delivering outputs exceeding 10 kHz. Comprehensive experiments on diverse platforms and across myriad motion scenes validate the framework's effectiveness. The results demonstrate the superior performance of Free-Init, highlighting the necessity of fast, resilient, and dynamic initialization for online systems.


WeBT3 Regular Session, 303	Add to My Program
Mechanism Design 2

Chair: Yim, Justin K.	University of Illinois Urbana-Champaign
Co-Chair: Santin, Marco	Aalen University

09:55-10:00, Paper WeBT3.1	Add to My Program
Development of a 2-DOF Singularity-Free Spherical Parallel Remote Center of Motion Mechanism with Extensive Range of Motion

Liu, Chun	National Taiwan University
Lin, Pei-Chun	National Taiwan University
Keywords: Actuation and Joint Mechanisms, Mechanism Design Abstract: In this paper, we report the development of an innovative two-degrees-of-freedom (2-DOF) spherical parallel remote center of motion mechanism (SPRCMM), which can offer a wide range of movement in both DOFs without encountering singularities. To facilitate the design process, the paper briefly reviews the existing spherical joints, including serial and parallel structures with and without the remote center of motion (RCM). Aiming at combining the advantages of these existing spherical joints, this paper proposes a novel design that utilizes the parallelogram mechanism to form a parallel RCM mechanism without using universal or spherical joints. Forward and inverse kinematics were constructed using the product of the exponentials. Moreover, space and closed Jacobians were derived, accompanied by manipulability in the available workspace for the mechanism. The prototype of the 2-DOF SPRCMM was built and experimentally evaluated. The experimental results confirm that the singularity-free motion of the two DOFs of the mechanism in a wide range is feasible, and the root mean squared errors in the trajectory tracking of the mechanism in most states were less than 10% of the motion range.

10:00-10:05, Paper WeBT3.2	Add to My Program
Highly Dynamic Physical Interaction for Robotics: Design and Control of an Active Remote Center of Compliance

Friedrich, Christian	Karlsruhe University of Applied Sciences
Frank, Patrick	Hochschule Karlsruhe - University of Applied Sciences (HKA)
Santin, Marco	Aalen University
Haag, Carl Matthias	Hochschule Aalen
Keywords: Mechanism Design, Force Control, Industrial Robots Abstract: Robot interaction control is often limited to low dynamics or low flexibility, depending on whether an active or passive approach is chosen. In this work, we introduce a hybrid control scheme that combines the advantages of active and passive interaction control. To accomplish this, we propose the design of a novel Active Remote Center of Compliance (ARCC), which is based on a passive and active element which can be used to directly control the interaction forces. We introduce surrogate models for a dynamic comparison against purely robot-based interaction schemes. In a comparative validation, ARCC drastically improves the interaction dynamics, leading to an increase in the motion bandwidth of up to 31 times. We introduce further our control approach as well as the integration in the robot controller. Finally, we analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail assembly and contour following problems and compare it against the state of the art, to highlight the dynamic and flexibility. The proposed system is especially suited if the application requires a low cycle time combined with a sensitive manipulation.

10:05-10:10, Paper WeBT3.3	Add to My Program
Pinto: A Latched Spring Actuated Robot for Jumping and Perching

Xu, Christopher	University of Illinois Urbana-Champaign
Yan, Huihan	University of Illinois at Urbana-Champaign
Yim, Justin K.	University of Illinois Urbana-Champaign
Keywords: Mechanism Design, Legged Robots, Compliant Joints and Mechanisms Abstract: Arboreal environments challenge current robots but are deftly traversed by many familiar animals such as squirrels. We present a small, 450 g robot "Pinto" developed for tree-jumping, a behavior seen in squirrels but rarely in legged robots: jumping from the ground onto a vertical tree trunk. We develop a powerful and lightweight latched series-elastic actuator using a twisted string and carbon fiber springs. We consider the effects of scaling down conventional quadrupeds and experimentally show how storing energy in a parallel-elastic fashion using a latch increases jump energy compared to series-elastic or springless strategies. By switching between series and parallel-elastic modes with our latched 5-bar leg mechanism, Pinto executes energetic jumps as well as maintains continuous control during shorter bounding motions. We also develop sprung 2-DoF arms equipped with spined grippers to grasp tree bark for high-speed perching following a jump.

10:10-10:15, Paper WeBT3.4	Add to My Program
D3-ARM: High-Dynamic, Dexterous and Fully Decoupled Cable-Driven Robotic Arm

Luo, Hong	Tsinghua University
Xu, Jianle	Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Liang, Huayue	Tsinghua University
Chen, Yanbo	Tsinghua University
Xia, Chongkun	Sun Yat-Sen University
Wang, Xueqian	Center for Artificial Intelligence and Robotics, Graduate School
Keywords: Tendon/Wire Mechanism, Mechanism Design, Robot Safety Abstract: Cable transmission enables motors of robotic arm to operate lightweight and low-inertia joints remotely in various environments, but it also creates issues with motion coupling and cable routing that can reduce arm's control precision and performance. In this paper, we present a novel motion decoupling mechanism with low-friction to align the cables and efficiently transmit the motor's power. By arranging these mechanisms at the joints, we fabricate a fully decoupled and lightweight cable-driven robotic arm called D3-Arm with all the electrical components be placed at the base. Its 776 mm length moving part boasts six degrees of freedom (DOF) and only 1.6 kg weights. To address the issue of cable slack, a cable-pretension mechanism is integrated to enhance the stability of long-distance cable transmission. Through a series of comprehensive tests, D3-Arm demonstrated 1.29 mm average positioning error and 2 kg payload capacity, proving the practicality of the proposed decoupling mechanisms in cable-driven robotic arm.

10:15-10:20, Paper WeBT3.5	Add to My Program
Design of an Articulated Modular Caterpillar Using Spherical Linkages

O'Connor, Sam	University of Notre Dame
Plecnik, Mark	University of Notre Dame
Keywords: Mechanism Design, Kinematics, Multi-Robot Systems Abstract: Articulation between body segments of small insects and animals is a three degree-of-freedom (DOF) motion. Implementing this kind of motion in a compact robot is usually not tractable due to limitations in small actuator technologies. In this work, we concede full 3-DOF control and instead select a one degree-of-freedom curve in SO(3) to articulate segments of a caterpillar robot. The curve is approximated with a spherical four-bar, which is synthesized through optimal rigid body guidance. We specify the desired SO(3) motion using discrete task positions, then solve for candidate mechanisms by computing all roots of the stationary conditions using numerical homotopy continuation. A caterpillar robot prototype demonstrates the utility of this approach. This synthesis procedure is also used to design prolegs for the caterpillar robot. Each segment contains two DC motors and a shape memory alloy, which is used for latching and unlatching between segments. The caterpillar robot is capable of walking, steering, object manipulation, body articulation, and climbing.

10:20-10:25, Paper WeBT3.6	Add to My Program
Generative-AI-Driven Jumping Robot Design Using Diffusion Models

Kim, Byungchul	MIT
Wang, Tsun-Hsuan	Massachusetts Institute of Technology
Rus, Daniela	MIT
Keywords: Mechanism Design, Methods and Tools for Robot System Design, Deep Learning Methods Abstract: Recent advances in foundation models are significantly expanding the capabilities of AI models. As part of this progress, this paper introduces a robot design framework that uses a diffusion model approach for generating 3D mesh structures. Specifically, we focus on generating directly fabricable robot structures that require no post-processing guided by human-imposed design constraints. Our approach can find the optimal design of the robot by optimizing or composing embedding vectors of the model. The efficacy of the framework is validated through an application to design, fabricate, and evaluate a jumping robot. Our solution is an optimized jumping robot with a 41% increase in jump height compared to the state-of-the-art design. Additionally, when the robot is augmented with an optimized foot, it can land reliably with a success ratio of 88% in contrast to the 4% success ratio of the base robot.


WeBT4 Regular Session, 304	Add to My Program
Sensor Fusion 1

Chair: Yuan, Shenghai	Nanyang Technological University
Co-Chair: Forbes, James Richard	McGill University

09:55-10:00, Paper WeBT4.1	Add to My Program
A Hessian for Gaussian Mixture Likelihoods in Nonlinear Least Squares

Korotkine, Vassili	McGill University
Cohen, Mitchell	McGill University
Forbes, James Richard	McGill University
Keywords: Sensor Fusion, Probabilistic Inference, SLAM Abstract: This paper proposes a novel Hessian approximation for Maximum a Posteriori estimation problems in robotics involving Gaussian mixture likelihoods. Previous approaches manipulate the Gaussian mixture likelihood into a form that allows the problem to be represented as a nonlinear least squares (NLS) problem. The resulting Hessian approximation used within NLS solvers from these approaches neglects certain nonlinearities. The proposed Hessian approximation is derived by setting the Hessians of the Gaussian mixture component errors to zero, which is the same starting point as for the Gauss-Newton Hessian approximation for NLS, and using the chain rule to account for additional nonlinearities. The proposed Hessian approximation results in improved convergence speed and uncertainty characterization for simulated experiments, and similar performance to the state of the art on real-world experiments. A method to maintain compatibility with existing solvers, such as ceres, is also presented. Accompanying software and supplementary material can be found at https://github.com/decargroup/hessian_sum_mixtures.

10:00-10:05, Paper WeBT4.2	Add to My Program
Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Xu, Jialei	Harbin Institute of Technology
Li, Rui	Northwestern Polytechnical University
Cheng, Kai	USTC
Jiang, Junjun	Harbin Institute of Technology
Liu, Xianming	Harbin Institute of Technology
Keywords: Deep Learning for Visual Perception, Sensor Fusion, RGB-D Perception Abstract: Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS^2 and ViViD++ datasets demonstrate the effectiveness and robustness of our method.

10:05-10:10, Paper WeBT4.3	Add to My Program
Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

Yang, Yiran	University of Chinese Academy of Sciences
Gao, Xu	Baidu
Wang, Tong	Baidu
Hao, Xin	Baidu
Shi, Yifeng	BAIDU.INC
Tan, Xiao	Baidu
Ye, Xiaoqing	Baidu Inc
Keywords: Computer Vision for Automation, Sensor Fusion Abstract: Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.

10:10-10:15, Paper WeBT4.4	Add to My Program
Bridging Spectral-Wise and Multi-Spectral Depth Estimation Via Geometry-Guided Contrastive Learning

Shin, Ukcheol	CMU(Carnegie Mellon University)
Lee, Kyunghyun	KAIST
Oh, Jean	Carnegie Mellon University
Keywords: Computer Vision for Transportation, Sensor Fusion, Deep Learning for Visual Perception Abstract: Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cues. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.

10:15-10:20, Paper WeBT4.5	Add to My Program
VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

Venkatramanan Sethuraman, Advaith	University of Michigan
Bagoren, Onur	University of Michigan
Seetharaman, Harikrishnan	University of Michigan - Ann Arbor
Richardson, Dalton	University of Michigan
Taylor, Joseph	University of Michigan, Ann Arbor
Skinner, Katherine	University of Michigan
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Sensor Fusion Abstract: Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural represen- tations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method’s effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.

10:20-10:25, Paper WeBT4.6	Add to My Program
CDMFusion: RGB-T Image Fusion Based on Conditional Diffusion Models Via Few Denoising Steps in Open Environments

Yang, Luojie	Beijing Institute of Technology
Yu, Meng	Beijing Institute of Technology
Fang, Lijin	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Sensor Fusion, Deep Learning for Visual Perception Abstract: Multi-modal fusion can improve perceptual robustness and accuracy by fully utilizing multi-source sensor data. Current RGB-T fusion methods still falter with adverse illumination and weather. Recent advances in generative methods have shown the ability to enhance and restore visible images in adverse conditions. However, the fusion of RGB-T based on generative methods has not been studied in depth, due to limited attention given to the degradation of multi-modal features under challenging circumstances. Motivated by this observation, we propose CDMFusion, a three-branch conditional diffusion model that achieves fusion with dynamically enhancing multi-modal features and suppressing high-frequency interference. Specifically, we achieve feature-preserving fusion through three branches and establish a dynamic gating prediction module to adjust the enhancement of multi-modal features adaptively. In addition, considering the high time cost of existing diffusion models for generating fused images, we propose a skip patrol mechanism to achieve accelerated high-quality generation with no need for additional training. Experiments demonstrate our method achieves excellent performance in multiple datasets. The code and datasets are available at https://github.com/yangluojie/CDMFusion.

10:25-10:30, Paper WeBT4.7	Add to My Program
UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

Zhao, Haocheng	Xi'an Jiaotong-Liverpool University
Guan, Runwei	University of Liverpool
Wu, Taoyu	Xi'an Jiaotong-Liverpool University
Man, Ka Lok	Xi'an Jiaotong-Liverpool University
Yu, Limin	Xi'an Jiaotong-Liverpool University
Yue, Yutao	Hong Kong University of Science and Technology (Guangzhou)
Keywords: Sensor Fusion, Object Detection, Segmentation and Categorization, AI-Based Methods Abstract: 4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird’s-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrated that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 3.96% in 3D and 4.17% in BEV object detection accuracy.


WeBT5 Regular Session, 305	Add to My Program
Aerial Robots: Mechanics and Control 1

Chair: Yamamoto, Ko	University of Tokyo
Co-Chair: Saldaña, David	Lehigh University

09:55-10:00, Paper WeBT5.1	Add to My Program
A Generalized Thrust Estimation and Control Approach for Multirotors Micro Aerial Vehicles

Santos, Davi Henrique dos	Universidade Federal Da Paraíba
Saska, Martin	Czech Technical University in Prague
Nascimento, Tiago	Universidade Federal Da Paraiba
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Motion Control Abstract: This paper addresses the problem of thrust estimation and control for the rotors of small-sized multirotors Uncrewed Aerial Vehicles (UAVs). Accurate control of the thrust generated by each rotor during flight is one of the main challenges for robust control of quadrotors. The most common approach is to approximate the mapping of rotor speed to thrust with a simple quadratic model. This model is known to fail under non-hovering flight conditions, introducing errors into the control pipeline. One of the approaches to modeling the aerodynamics around the propellers is the Blade Element Momentum Theory (BEMT). Here, we propose a novel BEMT-based closed-loop thrust estimator and control to eliminate the laborious calibration step of finding several aerodynamic coefficients. We aim to reuse known values as a baseline and fit the thrust estimate to values closest to the real ones with a simple test bench experiment, resulting in a single scaling value. A feedforward PID thrust control was implemented for each rotor, and the methods were validated by outdoor experiments with two multirotor UAV platforms: 250mm and 500mm. A statistical analysis of the results showed that the thrust estimation and control provided better robustness under aerodynamically varying flight conditions compared to the quadratic model.

10:00-10:05, Paper WeBT5.2	Add to My Program
Trajectory Planning and Control for Differentially Flat Fixed-Wing Aerial Systems

Morando, Luca	New York University
Salunkhe, Sanket Ankush	Colorado School of Mines
Bobbili, Nishanth	New York University
Mao, Jeffrey	New York University
Masci, Luca	New York University
Hung, Nguyen	Instituto Superior Técnico
De Souza Jr., Cristino	Technology Innovation Institute
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications Abstract: Efficient real-time trajectory planning and control for fixed-wing unmanned aerial vehicles is challenging due to their non-holonomic nature, complex dynamics, and the additional uncertainties introduced by unknown aerodynamic effects. In this paper, we present a fast and efficient real-time trajectory planning and control approach for fixed-wing unmanned aerial vehicles, leveraging the differential flatness property of fixed-wing aircraft in coordinated flight conditions to generate dynamically feasible trajectories. The approach provides the ability to continuously replan trajectories, which we show is useful to dynamically account for the curvature constraint as the aircraft advances along its path. Extensive simulations and real-world experiments validate our approach, showcasing its effectiveness in generating trajectories across various flight conditions, including wind disturbances.

10:05-10:10, Paper WeBT5.3	Add to My Program
Safe Quadrotor Navigation Using Composite Control Barrier Functions

Harms, Marvin Chayton	NTNU
Jacquet, Martin	NTNU
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy Abstract: This paper introduces a safety filter to ensure collision avoidance for multirotor aerial robots. The proposed formalism leverages a single Composite Control Barrier Function from all position constraints acting on a third-order nonlinear representation of the robot's dynamics. We analyze the recursive feasibility of the safety filter under the composite constraint and demonstrate that the infeasible set is negligible. The proposed method allows computational scalability against thousands of constraints and, thus, complex scenes with numerous obstacles. We experimentally demonstrate its ability to guarantee the safety of a quadrotor with an onboard LiDAR, operating in both indoor and outdoor cluttered environments against both naive and adversarial nominal policies.

10:10-10:15, Paper WeBT5.4	Add to My Program
The Spinning Blimp: Design and Control of a Novel Minimalist Aerial Vehicle Leveraging Rotational Dynamics and Locomotion

Santens, Leonardo	Lehigh University
S. D'Antonio, Diego	Lehigh University
Hou, Shuhang	Lehigh University
Saldaña, David	Lehigh University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications Abstract: This paper presents the Spinning Blimp, a novel lighter-than-air (LTA) aerial vehicle designed for low-energy stable flight. Using an oblate spheroid helium balloon for buoyancy, the vehicle achieves minimal energy consumption while maintaining prolonged airborne states. The unique and low-cost design employs a passively arranged wing coupled with a propeller to induce a spinning behavior, providing inherent pendulum-like stabilization. We propose a control strategy that takes advantage of the continuous revolving nature of the spinning blimp to control translational motion. The cost-effectiveness of the vehicle makes it highly suitable for a variety of applications, such as patrolling, localization, air and turbulence monitoring, and domestic surveillance. Experimental evaluations affirm the design's efficacy and underscore its potential as a versatile and economically viable solution for aerial applications.

10:15-10:20, Paper WeBT5.5	Add to My Program
One Net to Rule Them All: Domain Randomization in Quadcopter Racing across Different Platforms

Ferede, Robin	TU Delft
Blaha, Till Martin	Delft University of Technology
Lucassen, Erin	Delft University of Technology
De Wagter, Christophe	Delft University of Technology
de Croon, Guido	TU Delft
Keywords: Aerial Systems: Mechanics and Control, Reinforcement Learning, Robust/Adaptive Control Abstract: In high-speed quadcopter racing, finding a single controller that works well across different platforms remains challenging. This work presents the first neural network controller for drone racing that generalizes across physically distinct quadcopters. We demonstrate that a single network, trained with domain randomization, can robustly control various types of quadcopters. The network relies solely on the current state to directly compute motor commands. The effectiveness of this generalized controller is validated through real-world tests on two substantially different crafts (3-inch and 5-inch race quadcopters). We further compare the performance of this generalized controller with controllers specifically trained for the 3-inch and 5-inch drone, using their identified model parameters with varying levels of domain randomization (0%, 10%, 20%, 30%). While the generalized controller shows slightly slower speeds compared to the fine-tuned models, it excels in adaptability across different platforms. Our results show that no randomization fails sim-to-real transfer while increasing randomization improves robustness but reduces speed. Despite this trade-off, our findings highlight the potential of domain randomization for generalizing controllers, paving the way for universal AI controllers that can adapt to any platform.

10:20-10:25, Paper WeBT5.6	Add to My Program
Modeling and Control of Aerial Robot SERPENT: A Soft Structure Incorporated Multirotor Aerial Robot Capable of In-Flight Flexible Deformation

Itahara, Shotaro	The University of Tokyo
Nishio, Takuzumi	The University of Tokyo
Ishigaki, Taiki	The University of Tokyo
Sugihara, Junichiro	The University of Tokyo
Zhao, Moju	The University of Tokyo
Yamamoto, Ko	University of Tokyo
Keywords: Aerial Systems: Mechanics and Control Abstract: This paper introduces a novel method for controlling multirotor aerial robots connected by passive flexible elements. Despite the growing popularity of multirotor aerial robots, their real-world applications remain limited due to difficulties adapting to complex environments. Soft robotics, due to their inherent flexibility, offer a potential solution, though research on integrating flexible elements into aerial robots is still in the early stages. In this study, we propose control methods for a system where multiple aerial robots are interconnected with passive flexible elements. These robotic systems enhance adaptability, enabling tasks like object manipulation. We model the flexible parts using the piecewise constant strain (PCS) model, which allows for model-based closed-loop control and stabilizes various configurations of the system. Through simulations and experiments, we validated that the proposed method achieves both stable flight and flexible deformation. Notably, we succeeded in maintaining stable flight, which was not possible with traditional methods, and demonstrated both positional controllability and the ability of the flexible parts to bend dynamically during flight.

10:25-10:30, Paper WeBT5.7	Add to My Program
Embodying Compliant Touch on Drones for Aerial Tactile Navigation

Bredenbeck, Anton	TU Delft
Della Santina, Cosimo	TU Delft
Hamaza, Salua	TU Delft
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy, Compliant Joints and Mechanisms Abstract: Aerial robots are a well-established solution for environmental surveying, exploration, and inspection, thanks to their superior maneuverability and agility. Nowadays, the algorithms that provide these capabilities rely on GNSS and Vision, which are obstructed in some environments of interest, e.g., indoors and underground or in smoke and dust. In similar conditions, animals rely on the sense of touch and compliant responses to interactions embodied in the body morphology. This way, they can navigate safely using tactile cues by feeling the environment surrounding them. In this work, we take inspiration from the natural example and propose an approach that allows a quadrotor to navigate using tactile information from the environment. We propose to endow a conventional quadrotor with a novel robotic finger that embodies compliance and sensing capabilities. We complete this design with a navigation approach that generates new waypoints based on the robotic finger's contact information to follow the unknown environment. The overall system's evaluation shows successful, repeatable results in 36 flight experiments with various relative angles between the drone and a planar surface.


WeBT6 Regular Session, 307	Add to My Program
Vision-Based Navigation 2

Chair: Boukas, Evangelos	Technical University of Denmark
Co-Chair: Kottege, Navinda	CSIRO

09:55-10:00, Paper WeBT6.1	Add to My Program
Adaptive Learning for Hybrid Visual Odometry

Liu, Ziming	INRIA
Malis, Ezio	Inria
Martinet, Philippe	INRIA
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Transportation Abstract: Hybrid visual odometry methods achieve state-of-the-art performance by fusing both data-based deep learning networks and rule-based localization approaches. However, these methods also suffer from deep learning domain gap problems, which leads to an accuracy drop of the hybrid visual odometry approach when new type of data is considered. This paper is the first to explore a practical solution to this problem. Indeed, the deep learning network in the hybrid visual odometry predicts the stereo disparity with fixed searching space. However, the disparity distribution is unbalanced in stereo images acquired in different environments. We propose an adaptive network structure to overcome this problem. Secondly, the rule-based localization module has a robust performance by online optimizing the camera pose in test data, which motivates us to introduce test-time training machine learning method for improving the data-based part of the hybrid visual odometry.

10:00-10:05, Paper WeBT6.2	Add to My Program
SOLVR: Submap Oriented LiDAR-Visual Re-Localisation

Knights, Joshua Barton	Queensland University of Technology
Barbas Laina, Sebastián	TU Munich
Moghadam, Peyman	CSIRO
Leutenegger, Stefan	Technical University of Munich
Keywords: Deep Learning Methods, Deep Learning for Visual Perception, Recognition Abstract: This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place.

10:05-10:10, Paper WeBT6.3	Add to My Program
SSF: Sparse Long-Range Scene Flow for Autonomous Driving

Khoche, Ajinkya	KTH Royal Institute of Technology Stockholm, SCANIA CV AB
Zhang, Qingwen	KTH Royal Institute of Technology
Pereira Sanchez, Laura	Stanford University
Asefaw, Aron	Royal Institute of Technology
Sharif Mansouri, Sina	Scania
Jensfelt, Patric	KTH - Royal Institute of Technology
Keywords: Deep Learning Methods, Computer Vision for Transportation, Object Detection, Segmentation and Categorization Abstract: Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our source code is open-sourced at https://github.com/KTH-RPL/SSF.

10:10-10:15, Paper WeBT6.4	Add to My Program
BoxMap: Efficient Structural Mapping and Navigation

Wang, Zili	Boston University
Allum, Christopher	Boston University
Andersson, Sean	Boston University
Tron, Roberto	Boston University
Keywords: Deep Learning Methods, Autonomous Agents, Task and Motion Planning Abstract: While humans can successfully navigate using abstractions, ignoring details that are irrelevant to the task at hand, most of the existing approaches in robotics require detailed environment representations which consume a significant amount of sensing, computing, and storage; these issues become particularly important in resource-constrained settings with limited power budgets. Deep learning methods can learn from prior experience to abstract knowledge from novel environments, and use it to more efficiently execute tasks such as frontier exploration, object search, or scene understanding. We propose BoxMap, a Detection-Transformer-based architecture that takes advantage of the structure of the sensed partial environment to update a topological graph of the environment as a set of semantic entities (rooms and doors) and their relations (connectivity). The predictions from low-level measurements can be leveraged to achieve high-level goals with lower computational costs than methods based on detailed representations. As an example application, we consider a robot equipped with a 2-D laser scanner tasked with exploring a residential building. Our BoxMap representation scales quadratically with the number of rooms (with a small constant), resulting in significant savings over a full geometric map. Moreover, our high-level topological representation results in 30.9% shorter trajectories in the exploration task with respect to a standard method. Code is available at: bit.ly/3F6w2Yl.

10:15-10:20, Paper WeBT6.5	Add to My Program
UncAD: Towards Safe End-To-End Autonomous Driving Via Online Map Uncertainty

Yang, Pengxuan	University of Chinese Academy of Sciences (UCAS)
Zheng, Yupeng	School of Artificial Intelligence, University of Chinese Academy
Zhang, Qichao	Institute of Automation, Chinese Academy of Sciences
Zhu, Kefei	UCAS
Xing, Zebin	UCAS
Lin, Qiao	EACON Technology Co., Ltd
Liu, Yun-Fu	Eacon
Su, Zhiguo	EACON Technology Co., Ltd
Zhao, Dongbin	Chinese Academy of Sciences
Keywords: Vision-Based Navigation, Integrated Planning and Learning, Computer Vision for Transportation Abstract: End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.

10:20-10:25, Paper WeBT6.6	Add to My Program
Multi-Floor Zero-Shot Object Navigation Policy

Zhang, Lingfeng	The Hong Kong University of Science and Technology (Guangzhou)
Wang, Hao	Hong Kong University of Science and Technology(Guang Zhou)
Xiao, Erjia	The Hong Kong University of Science and Technology (Guangzhou)
Zhang, Xinyao	Hong Kong University of Science and Technology (GUANGZHOU)
Zhang, Qiang	The Hong Kong University of Science and Technology (Guangzhou)
Jiang, Zixuan	HKUST(GZ)
Xu, Renjing	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Vision-Based Navigation, Embodied Cognitive Science, Visual Learning Abstract: Object navigation in multi-floor environments presents a formidable challenge in robotics, requiring sophisticated spatial reasoning and adaptive exploration strategies. Traditional approaches have primarily focused on single-floor scenarios, overlooking the complexities introduced by multi-floor structures. To address these challenges, we first propose a Multi-floor Navigation Policy (MFNP) and implement it in Zero-Shot object navigation tasks. Our framework comprises three key components: (i) Multi-floor Navigation Policy, which enables an agent to explore across multiple floors; (ii) Multi-modal Large Language Models (MLLMs) for reasoning in the navigation process; and (iii) Inter-Floor Navigation, ensuring efficient floor transitions. We evaluate MFNP on the Habitat-Matterport 3D (HM3D) and Matterport 3D (MP3D) datasets, both include multi-floor scenes. Our experiment results demonstrate that MFNP significantly outperforms all the existing methods in Zero-Shot object navigation, achieving higher success rates and improved exploration efficiency. Ablation studies further highlight the effectiveness of each component in addressing the unique challenges of multi-floor navigation. Meanwhile, we conducted real-world experiments to evaluate the feasibility of our policy. Upon deployment of MFNP, the Unitree quadruped robot demonstrated successful multi-floor navigation and found the target object in a completely unseen environment. By introducing MFNP, we offer a new paradigm for tackling complex, multi-floor environments in object navigation tasks, opening avenues for future research in visual-based navigation in realistic, multi-floor settings.

10:25-10:30, Paper WeBT6.7	Add to My Program
Fed-EC: Bandwidth-Efficient Clustering-Based Federated Learning for Autonomous Visual Robot Navigation

Gummadi, Shreya	University of Illinois at Urbana-Champaign
Valverde Gasparino, Mateus	University of Illinois at Urbana-Champaign
Vasisht, Deepak	University of Illinois at Urbana Champaign
Chowdhary, Girish	University of Illinois at Urbana Champaign
Keywords: Distributed Robot Systems, Vision-Based Navigation, Field Robots Abstract: Centralized learning requires data to be aggregated at a central server, which poses significant challenges in terms of data privacy and bandwidth consumption. Federated learning presents a compelling alternative, however, vanilla Federated Learning methods deployed in robotics aim to learn a single global model across robots that works ideally for all. But in practice one model may not be well suited for robots deployed in various environments. This paper proposes Federated-EmbedCluster (Fed-EC), a clustering-based federated learning framework that is deployed with vision based autonomous robot navigation in diverse outdoor environments. The framework addresses the key federated learning challenge of deteriorating model performance of a single global model due to the presence of non-IID data across real-world robots. Extensive real-world experiments validate that Fed-EC reduces the communication size by 23x for each robot while matching the performance of centralized learning for goal-oriented navigation and outperforms local learning. Fed-EC can transfer previously learnt models to new robots that join the cluster.


WeBT7 Regular Session, 309	Add to My Program
Perception 1

Chair: Stasse, Olivier	LAAS, CNRS
Co-Chair: Cho, Younggun	Inha University

09:55-10:00, Paper WeBT7.1	Add to My Program
Using a Distance Sensor to Detect Deviations in a Planar Surface

Sifferman, Carter	University of Wisconsin-Madison
Sun, William	UW-Madison
Gupta, Mohit	University of Wisconsin-Madison
Gleicher, Michael	University of Wisconsin - Madison
Keywords: Range Sensing, Deep Learning for Visual Perception, Vision-Based Navigation Abstract: We investigate methods for determining if a planar surface contains geometric deviations (e.g. protrusions, objects, divots, or cliffs) using only an instantaneous measurement from a miniature optical time-of-flight sensor. The key to our method is to utilize the entirety of information encoded in raw time-of-flight data captured by off-the-shelf distance sensors. We provide an analysis of the problem in which we identify the key ambiguity between geometry and surface photometrics. To overcome this challenging ambiguity, we fit a Gaussian mixture model to a small dataset of planar surface measurements. This model implicitly captures the expected geometry and distribution of photometrics of the planar surface and is used to identify measurements that are likely to contain deviations. We characterize our method on a variety of surfaces and planar deviations across a range of scenarios. We find that our method utilizing raw time-of-flight data outperforms baselines which use only derived distance estimates. We build an example application in which our method enables mobile robot obstacle and cliff avoidance over a wide field-of-view.

10:00-10:05, Paper WeBT7.2	Add to My Program
Narrowing Your FOV with SOLiD: Spatially Organized and Lightweight Global Descriptor for FOV-Constrained LiDAR Place Recognition

Kim, Hogyun	Inha University
Choi, Jiwon	Inha University
Sim, Taehu	Inha Uiversity
Kim, Giseop	DGIST (Daegu Gyeongbuk Institute of Science and Technology)
Cho, Younggun	Inha University
Keywords: Localization, SLAM, Range Sensing Abstract: We often encounter limited FOV situations due to various factors such as sensor fusion or sensor mount in real-world robot navigation. However, the limited FOV interrupts the generation of descriptions and impacts place recognition adversely. Therefore, we suffer from correcting accumulated drift errors in a consistent map using LiDAR-based place recognition with limited FOV. Thus, in this paper, we propose a robust LiDAR-based place recognition method for handling narrow FOV scenarios. The proposed method establishes spatial organization based on the range-elevation bin and azimuth-elevation bin to represent places. In addition, we achieve a robust place description through reweighting based on vertical direction information. Based on these representations, our method enables addressing rotational changes and determining the initial heading. Additionally, we designed a lightweight and fast approach for the robot's onboard autonomy. For rigorous validation, the proposed method was tested across various LiDAR place recognition scenarios (i.e., single-session, multi-session, and multi-robot scenarios). To the best of our knowledge, we report the first method to cope with the restricted FOV. Our place description and SLAM codes will be released. Also, the supplementary materials of our descriptor are available at https://sites.google.com/view/lidar-solid.

10:05-10:10, Paper WeBT7.3	Add to My Program
Towards Survivability in Complex Motion Scenarios: RGB-Event Object Tracking Via Historical Trajectory Prompting

Xia, Wenhao	Dalian University of Technology
Zhu, Jiawen	Dalian University of Technology
He, You	Tsinghua University
Qi, Jinqing	Dalian University of Technology
Huang, Zihao	Dalian University of Technology
Jia, Xu	Dalian University of Technology
Keywords: Visual Tracking, Deep Learning for Visual Perception, Data Sets for Robotic Vision Abstract: 事件数据最近成为 object 的有价值的辅助对象跟踪，提供具有密集时间分辨率的提示，以及高动态范围。现有的 RGB 事件跟踪器通常在使用仅靠 RGB 功能无法实现的复杂运动轨迹提供足够的鉴别力。为了解决这个问题，我们提出了一个创新的 RGB 事件跟踪框架，称为 EventTPT，通过触发嵌入在历史轨迹。具体来说，EventTPT 集成了多个相邻帧的轨迹转换为单个事件图像使用时间加权聚合和随后将其作为视觉提示输入到跟踪器进行当前帧定位。跨模态自适应融合模块进一步设计用于光度不一致的情况。此外，我们提出了一种新颖的具有挑战性的 RGB 事件跟踪基准， EventUAV，包含具有高运动复杂&#

10:10-10:15, Paper WeBT7.4	Add to My Program
Spatially Constrained and Deeply Learned Bilateral Structural Intensity-Depth Registration Autonomously Navigates a Flexible Endoscope

Fang, Hao	Xiamen University
Wu, Ming	Xiamen University
Fan, Wenkang	Xiamen University
Luo, Guangcheng	Zhongshan Hospital Xiamen University
Luo, Xiongbiao	Xiamen University
Keywords: Vision-Based Navigation, Visual Tracking Abstract: Endoscope tracking is commonly utilized to provide surgeons with in-body camera poses and visual fields during invasive procedures. The fundamental aspect of endoscopic navigation lies in precisely and continuously tracing the position and orientation of the endoscope within monocular endoscopic video sequences in a preoperative data space. This work proposes a new spatially constrained and deeply learned bilateral structural intensity-depth 2D-3D registration framework for autonomously navigating a flexible endoscope. Concretely, a novel bilateral structural intensity-depth similarity function is defined to tackle the deficiency of using image intensity, while a cross-domain monocular depth estimation model trained on virtual image data is used to accurately predict real image dense depth. Additionally, a spatial constraint is introduced to precisely reinitialize an optimizer to reduce accumulative tracking errors. We validate our method on clinical data, with the experimental results showing that our method significantly outperforms current vision-based navigation methods. Particularly, the average of position and orientation errors were reduced from (4.59mm, 9.22degree) to (1.65mm, 4.67degree).

10:15-10:20, Paper WeBT7.5	Add to My Program
E2B: A Single Modality Point-Based Tracker with Event Cameras

Ren, Hongwei	The Hong Kong University of Science and Technology (Guangzhou)
Li, Zhuo	Peking University
Tuerhong, Aiersi	Chongqing University
Liu, Haobo	The University of Electronic Science and Technology of China
Liang, Fei	Huawei Technologies Company Ltd
Feng, Yongxiang	Huawei Technologies Company Ltd
Wang, Wenhui	Tsinghua University
Wang, Yaoyuan	Huawei
Zhang, Ziyang	Huawei, China
He, Weihua	Tsinghua University
Cheng, Bojun	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Visual Tracking, Representation Learning, Deep Learning Methods Abstract: High-speed object tracking holds significant relevance across robotic domains, such as drones and autonomous driving. Compared to conventional cameras, event cameras are equipped with the ability to capture object motion information at exceptionally high temporal resolution with relatively low power consumption and remain immune from motion-blurring effects. Regrettably, many existing methods adopt a frame-based approach by stacking events into Event Frame, which overlooks the sparsity and high temporal resolution of events. This approach is reliant on the pre-training backbone and reaches a performance plateau but demands unrealistically large networks and high power consumption, rendering it impractical for real-time applications in battery-constrained scenarios. In this paper, we propose an efficient and effective single-modality tracker using Point Cloud representation named E2B (Event to Box). By directly handling the raw output of event cameras without dataformat transformation, E2B leverages events' coordinate guidance to accurately map Event Cloud features to 2D bounding boxes. Moreover, E2B incorporates the pyramid structure into the multi-stage feature extraction architecture to effectively track objects across diverse scales. In the experiments, E2B performs outstandingly on two large-scale and one synthetic event-based tracking datasets, covering both indoor and outdoor environments, as well as rigid and non-rigid objects.

10:20-10:25, Paper WeBT7.6	Add to My Program
F²R²: Frequency Filtering-Based Rectification Robustness Method for Stereo Matching

Zhou, Haolong	Shanghai Institute of Microsystem and Information Technology, Ch
Zhu, Dongchen	Shanghai Institute of Microsystem and Information Technology, Chi
Zhang, Guanghui	Shanghai Institute of Microsystem and Information Technology, Ch
Wang, Lei	Shanghai Institute of Microsystem and Information Technology, Ch
Li, Jiamao	Shanghai Institute of Microsystem and Information Technology, Chi
Keywords: Deep Learning for Visual Perception Abstract: Most stereo matching networks assume that the stereo images are perfectly rectified, ignoring the perturbation of extrinsic parameters due to collisions, mechanical vibrations, and thermal expansion. This leads to poor rectification robustness in real-world stereo systems. That is, even minor rectification errors can lead to failure, making stereo systems unreliable for long-term autonomous operation in complex environments. In this paper, we are the first to propose a frequency filtering-based rectification robustness (F²R²) method for stereo matching, which aims to enhance the robustness of existing stereo networks to rectification errors. Specifically, we propose a sensitive frequency filter (SFF) to remove components susceptible to rectification errors within the frequency domain. SFF achieves the filtering through the learning-based adaptive filtering mask (AFM) guided by the spatial-frequency mapping modulation mask (SFM). Moreover, we build the matching feature reconstruction module (MFRM) to recover the features lost during filtering to benefit cost aggregation. Comprehensive experiments on simulated datasets and self-collected data validate that our method can significantly enhance the rectification robustness of stereo matching networks.

10:25-10:30, Paper WeBT7.7	Add to My Program
VisTune: Auto-Tuner for UAVs Using Vision-Based Localization

Humais, Muhammad Ahmed	Khalifa University
Chehadeh, Mohamad	Khalifa University for Science and Technology
Azzam, Rana	Khalifa University of Science and Technology
Boiko, Igor	Khalifa University
Zweiri, Yahya	Khalifa University
Keywords: Vision-Based Navigation, Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy Abstract: This paper presents VisTune, a method for automatic controller tuning, specifically designed for UAVs using vision-based localization for position control. In contrast to existing methods that involve flying the UAV manually to collect the data for system identification and tuning, our approach leverages relay-based system identification and tuning that autonomously generates stable oscillations, without the need for stabilizing controller. The whole process concludes within few seconds. Prior work in vision-based position control of the UAVs often ignores the delay from the perception pipeline, which is quite significant and results in suboptimal tuning and poor control performance. Our approach accounts for perception delay and addresses practical issues, such as varying delays due to varying computation requirements and inevitable estimation errors, which pose challenges in applying relay-based identification and tuning. Typically, VBL system introduces over 100 ms delay, compared to less than 20 ms delay when motion capture system is used. Moreover, we show that the perception delay identified by VisTune can be effectively used to temporally advance the feedforward acceleration signal to achieve better tracking performance. Finally, we demonstrate the robustness of the tuned controllers on a trajectory tracking task, reaching speed up to 2.1 m/s with RMS control error of only 0.054 m while under wind disturbance of 5 m/s we report RMSE of 0.116 m. A video of experiments is available at https://youtu.be/hJoT8bn0K0o


WeBT8 Regular Session, 311	Add to My Program
Representation Learning 2

Chair: Held, David	Carnegie Mellon University
Co-Chair: Shin, Ukcheol	CMU(Carnegie Mellon University)

09:55-10:00, Paper WeBT8.1	Add to My Program
GeMuCo: Generalized Multisensory Correlational Model for Body Schema Learning

Kawaharazuka, Kento	The University of Tokyo
Okada, Kei	The University of Tokyo
Inaba, Masayuki	The University of Tokyo
Keywords: Learning from Experience, Software Architecture for Robotic and Automation, Cognitive Control Architectures Abstract: Humans can autonomously learn the relationship between sensation and motion in their own bodies, estimate and control their own body states, and move while continuously adapting to the current environment. On the other hand, current robots control their bodies by learning the network structure described by humans from their experiences, making certain assumptions on the relationship between sensors and actuators. In addition, the network model does not adapt to changes in the robot's body, the tools that are grasped, or the environment, and there is no unified theory, not only for control but also for state estimation, anomaly detection, simulation, and so on. In this study, we propose a Generalized Multisensory Correlational Model (GeMuCo), in which the robot itself acquires a body schema describing the correlation between sensors and actuators from its own experience, including model structures such as network input/output. The robot adapts to the current environment by updating this body schema model online, estimates and controls its body state, and even performs anomaly detection and simulation. We demonstrate the effectiveness of this method by applying it to tool-use co

10:00-10:05, Paper WeBT8.2	Add to My Program
SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting

Qureshi, Mohammad Nomaan	Carnegie Mellon University
Garg, Sparsh	Carnegie Mellon University
Yandun, Francisco	Carnegie Mellon University
Held, David	Carnegie Mellon University
Kantor, George	Carnegie Mellon University
Silwal, Abhisesh	Carnegie Mellon University
Keywords: Sensorimotor Learning, Learning from Demonstration, Data Sets for Robot Learning Abstract: Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between synthetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSim and deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data.

10:05-10:10, Paper WeBT8.3	Add to My Program
SR-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models

Nguyen, Viet Dung	Rochester Institute of Technology
Yang, Zhizhuo	Rochester Institute of Technology
Buckley, Christopher	Verses AI
Ororbia, Alexander	Rochester Institute of Technology
Keywords: Reinforcement Learning, Deep Learning Methods, Bioinspired Robot Learning Abstract: Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.

10:10-10:15, Paper WeBT8.4	Add to My Program
Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning

Keller, Leon	TU Darmstadt
Tanneberg, Daniel	Honda Research Institute Europe
Peters, Jan	Technische Universität Darmstadt
Keywords: Imitation Learning, Representation Learning, Task and Motion Planning Abstract: Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.

10:15-10:20, Paper WeBT8.5	Add to My Program
Chain-Of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

Wang, Chen	Stanford University
Xia, Fei	Google Inc
Yu, Wenhao	Google
Zhang, Tingnan	Google
Zhang, Ruohan	Stanford University
Liu, Karen	Stanford University
Fei-Fei, Li	Stanford University
Tan, Jie	Google
Liang, Jacky	Google
Keywords: Machine Learning for Robot Control, Embodied Cognitive Science, AI-Enabled Robotics Abstract: Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data --- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments.

10:20-10:25, Paper WeBT8.6	Add to My Program
VertiCoder: Self-Supervised Kinodynamic Representation Learning on Vertically Challenging Terrain

Nazeri, Mohammad	George Mason University
Datar, Aniket	George Mason University
Pokhrel, Anuj	George Mason University
Pan, Chenhui	George Mason University
Warnell, Garrett	U.S. Army Research Laboratory
Xiao, Xuesu	George Mason University
Keywords: Representation Learning, Learning from Experience, Wheeled Robots Abstract: We present VertiCoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiCoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiCoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiCoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiCoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiCoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.

10:25-10:30, Paper WeBT8.7	Add to My Program
Correspondence Learning between Morphologically Different Robots Via Task Demonstrations

Aktas, Hakan	The University of Cambridge
Nagai, Yukie	The University of Tokyo
Asada, Minoru	Open and Transdisciplinary Research Initiatives, Osaka Universit
Oztop, Erhan	Osaka University / Ozyegin University
Ugur, Emre	Bogazici University
Keywords: Developmental Robotics, Imitation Learning, Deep Learning Methods Abstract: We observe a large variety of robots in terms of their bodies, sensors, and actuators. Given the commonalities in the skill sets, teaching each skill to each different robot independently is inefficient and not scalable when the large variety in the robotic landscape is considered. If we can learn the correspondences between the sensorimotor spaces of different robots, we can expect a skill that is learned in one robot can be more directly and easily transferred to other robots. In this paper, we propose a method to learn correspondences hakan{among two or more robots that may have different morphologies. To be specific, besides robots with similar morphologies with different degrees of freedom, we show that a fixed-based manipulator robot with joint control and a differential drive mobile robot can be addressed within the proposed framework. To set up the correspondence among the robots considered, an initial base task is demonstrated to the robots to achieve the same goal. Then, a common latent representation is learned along with the individual robot policies for achieving the goal.} After the initial learning stage, the observation of a new task execution by one robot becomes sufficient to generate a latent space representation pertaining to the other robots to achieve the same task. We verified our system in a set of experiments where the correspondence between robots is learned (1) when the robots need to follow the same paths to achieve the same task, (2) when the robots need to follow different trajectories to achieve the same task, and (3) when complexities of the required sensorimotor trajectories are different for the robots. We also provide a proof-of-the-concept realization of correspondence learning between a real manipulator robot and a simulated mobile robot.


WeBT9 Regular Session, 312	Add to My Program
Multi-Robot Exploration

Chair: Solis Vidana, Juan Irving	University of Illinois Urbana-Champaign
Co-Chair: Pedram, Ali Reza	Georgia Institute of Technology

09:55-10:00, Paper WeBT9.1	Add to My Program
Planning-Oriented Cooperative Perception among Heterogeneous Vehicles

Zheng, Han	Stony Brook University
Ye, Fan	Stony Brook University
Yang, Yuanyuan	Stony Brook University
Keywords: Multi-Robot Systems, Cooperating Robots, Collision Avoidance Abstract: Vehicle-to-vehicle (V2V) based cooperative perception enhances autonomous driving by overcoming single-agent perception limitations such as occlusions, without relying on extensive infrastructure. However, most existing methods have two key limitations. They treat cooperative perception in isolation, with little consideration for downstream tasks such as planning, leading to poor coordination and inefficient planning decisions. They also assume perception model homogeneity across all vehicles, which can be impractical among vehicles from different manufacturers. To bridge such gaps, we propose Scout, an early-fusion framework for planning-oriented cooperative perception among vehicles of heterogeneous models. Specifically, we formalize a notion of emph{Deltatheta-Risk Increment Distribution (RID)} to capture the distribution of the risk increment by incomplete perception to the current trajectory plan, and define a Priority Index (PI) metric for prioritizing cooperative perception on riskier regions. We develop algorithms to estimate emph{Delta theta-RID} and PI at run-time with theoretical bounds. Empirical results demonstrate that Scout surpasses state-of-the-art methods and strong baselines on challenging benchmarks, achieving higher success rates with only 3-10% of their communication volume.

10:00-10:05, Paper WeBT9.2	Add to My Program
TaskExp: Enhancing Generalization of Multi-Robot Exploration with Multi-Task Pre-Training

Zhu, Shaohao	Zhejiang University
Zhao, Yixian	Zhejiang University
Xu, Yang	Zhejiang University
Chen, Anjun	Zhejiang University
Chen, Jiming	Zhejiang University
Xu, Jinming	Zhejiang University
Keywords: Reinforcement Learning, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: We aim to develop a general multi-agent reinforcement learning (MARL) policy that enables a group of robots to efficiently explore large-scale, unknown environments with random pose initialization. Existing MARL-based multi-robot exploration methods face challenges in reliably mapping observations to actions in large-scale scenarios and lack of zero-shot generalization to unknown environments. To this end, we propose a generic multi-task pre-training algorithm (termed TaskExp) to enhance the generalization of learning-based policies. In particular, we design a decision-related task to guide the policy to focus on valuable subspaces of the action space, improving the reliability of policy mapping. Moreover, two perception-related tasks--Location Estimation and Map Prediction--are designed to enhance the zero-shot capability of the policy by guiding it to extract general invariant features from unknown environments. With TaskExp pre-training, our policy significantly outperforms state-of-the-art planning-based methods in large-scale scenarios and demonstrates strong zero-shot performance in unseen environments. Furthermore, TaskExp can also be easily integrated to improve the existing learning-based multi-robot exploration methods.

10:05-10:10, Paper WeBT9.3	Add to My Program
WcDT: World-Centric Diffusion Transformer for Traffic Scene Generation

Yang, Chen	Cardiff University
He, Yangfan	University of Minnesota - Twin Cities
Tian, Aaron Xuxiang	Independent Researcher
Chen, Dong	Mississippi State University
Wang, Jianhui	University of Electronic Science and Technology of China
Shi, Tianyu	University of Toronto
Heydarian, Arsalan	University of Virginia
Liu, Pei	The Hong Kong University of Science and Technology(GuangZhou)
Keywords: Path Planning for Multiple Mobile Robots or Agents, Planning under Uncertainty, Deep Learning Methods Abstract: In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-centric Diffusion Transformer"(WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into "Agent Move Statement" and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that is used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems.

10:10-10:15, Paper WeBT9.4	Add to My Program
Hybrid Decentralization for Multi-Robot Orienteering with Mothership-Passenger Systems

Butler, Nathan	Oregon State University
Hollinger, Geoffrey	Oregon State University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Marine Robotics Abstract: We present a hybrid centralized-decentralized planning algorithm for a multi-robot system made up of a single Mothership robot and multiple Passenger robots. In this system, the Passenger robots execute tasks while the Mothership provides support. This paper addresses the challenge of planning Passenger robot movements, framing it as a Stochastic Multi-Agent Orienteering Problem (SMOP) complicated by factors like stochastic operational efforts and disruptive events. We optimize the task completion efficiency of the system by combining centralized solutions from the Mothership with local plans from Passengers to enhance system resilience. Our contributions include defining the SMOP, developing a distributed solution using decentralized Monte Carlo tree search, presenting a hybrid algorithm that integrates centralized plans into the distributed framework, and evaluating the algorithm’s performance in simulation using real-world data. Our results show that our hybrid approaches outperform fully centralized and fully distributed algorithms in dynamic and disruptive scenarios with up to 26.6% increase in task completion efficiency over baseline methods.

10:15-10:20, Paper WeBT9.5	Add to My Program
Communication-Aware Iterative Map Compression for Online Path-Planning

Psomiadis, Evangelos	Georgia Institute of Technology
Pedram, Ali Reza	Georgia Institute of Technology
Maity, Dipankar	University of North Carolina at Charlotte
Tsiotras, Panagiotis	Georgia Tech
Keywords: Multi-Robot Systems, Mapping Abstract: This paper addresses the problem of optimizing communicated information among heterogeneous, resource-aware robot teams to facilitate their navigation. In such operations, a mobile robot compresses its local map to assist another robot in reaching a target within an uncharted environment. The primary challenge lies in ensuring that the map compression step balances network load while transmitting only the most essential information for effective navigation. We propose a communication framework that sequentially selects the optimal map compression in a task-driven, communication-aware manner. It introduces a decoder capable of iterative map estimation, handling noise through Kalman filter techniques. The computational speed of our decoder allows for a larger compression template set compared to previous methods, and enables applications in more challenging environments. Specifically, our simulations demonstrate a remarkable 98% reduction in communicated information, compared to a framework that transmits the raw data, on a large Mars inclination map and an Earth map, all while maintaining similar planning costs. Furthermore, our method significantly reduces computational time compared to the state-of-the-art approach.

10:20-10:25, Paper WeBT9.6	Add to My Program
DiffCP: Ultra-Low Bit Collaborative Perception Via Diffusion Model

Mao, Ruiqing	Tsinghua University
Wu, Haotian	Imperial College London
Jia, Yukuan	Tsinghua University
Nan, Zhaojun	Tsinghua University
Sun, Yuxuan	Beijing Jiaotong University
Zhou, Sheng	Tsinghua University
Gunduz, Deniz	İmperial College London
Niu, Zhisheng	Tsinghua University
Keywords: Cooperating Robots, Deep Learning for Visual Perception, Intelligent Transportation Systems Abstract: Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a diffusion model to efficiently compress the sensing information of collaborators. By incorporating both geometric and semantic conditions into the generative model, DiffCP enables feature-level collaboration with an ultra-low communication cost, advancing the practical implementation of CP systems. This paradigm can be seamlessly integrated into existing CP algorithms to enhance a wide range of downstream tasks. Through extensive experimentation, we investigate the trade-offs between communication, computation, and performance. Numerical results demonstrate that DiffCP can significantly reduce communication costs by 14.5-fold while maintaining the same performance as the state-of-the-art algorithm.


WeBT10 Regular Session, 313	Add to My Program
Multi-Robot Path Planning 2

Chair: Pierson, Alyssa	Boston University
Co-Chair: Nam, Changjoo	Sogang University

09:55-10:00, Paper WeBT10.1	Add to My Program
APF-CPP: An Artificial Potential Field Based Multi-Robot Online Coverage Path Planning Approach

Wang, Zikai	Hongkong University of Science and Technology
Zhao, Xiaoqi	The Hong Kong University of Science and Technology
Zhang, Jiekai	Hong Kong Applied Science and Technology Research Institute
Yang, Nachuan	Hong Kong University of Science and Technology
Wang, Pengyu	Hong Kong University of Science and Technology
Tang, Jiawei	Hong Kong University of Science and Technology
Zhang, Jiuzhou	Hong Kong University of Science and Technology
Shi, Ling	The Hong Kong University of Science and Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: Multi-robot coverage planning has gained significant attention in recent years. In this paper, we introduce a novel approach called APF-CPP (Artificial Potential Field Based Multi-Robot Online Coverage Path Planning) to enhance the collaboration of multi-robot systems to accomplish coverage tasks in unknown dynamic environments. Our approach presents a unique coverage policy that leverages the concept of artificial potential field (APF). In contrast to the conventional APF-based path planning methods that directly generate paths based on the field gradient, we utilize the APF to derive coverage policies for individual robots within a multi-robot system to achieve efficient task allocation and maintain regular coverage patterns. We have developed a policy update mechanism that allows the system to adapt its task allocation policy based on real-time conditions while minimizing the impact caused by policy changes. To better handle dead-end conditions, we use the APF concept to allocate tasks better during the dead-end recovery process. We also show that our algorithm has a low computational complexity and guarantees complete coverage in a finite time. We conduct extensive comparisons with other state-of-the-art (SOTA) approaches and validate our method through simulations and real-world experiments. The experimental results demonstrate the advantages of our proposed method over existing approaches and confirm the effectiveness and robustness of real-world implementation.

10:00-10:05, Paper WeBT10.2	Add to My Program
Exact Wavefront Propagation for Globally Optimal One-To-All Path Planning on 2D Cartesian Grids

Ibrahim, Ibrahim	KU Leuven
Gillis, Joris	KU Leuven
Decré, Wilm	Katholieke Universiteit Leuven
Swevers, Jan	KU Leuven
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Computational Geometry Abstract: This paper introduces an efficient mathcal{O}(n) compute and memory complexity algorithm for globally optimal path planning on 2D Cartesian grids. Unlike existing marching methods that rely on approximate discretized solutions to the Eikonal equation, our approach achieves exact wavefront propagation by pivoting the analytic distance function based on visibility. The algorithm leverages a dynamic-programming subroutine to efficiently evaluate visibility queries. Through benchmarking against state-of-the-art any-angle path planners, we demonstrate that our method outperforms existing approaches in both speed and accuracy, particularly in cluttered environments. Notably, our method inherently provides globally optimal paths to all grid points, eliminating the need for additional gradient descent steps per path query. The same capability extends to multiple starting positions. We also provide a greedy version of our algorithm as well as open-source C++ implementation of our solver.

10:05-10:10, Paper WeBT10.3	Add to My Program
ICBSS: An Improved Algorithm for Multi-Agent Combinatorial Path Finding

Chen, Zheng	Zhejiang University
Chen, Changlin	University of Science and Technology of China
Yiran, Ni	Zhejiang University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Multi-Robot Systems Abstract: The Multi-Agent Combinatorial Path Finding (MCPF) problem is a generalized version of the Multi-Agent Path Finding (MAPF) problem, in which each agent must collectively visit multiple intermediate target locations on the way to their final destinations. The state-of-the-art approach for addressing MCPF, known as Conflict-Based Steiner Search (CBSS) cite{DBLP:journals/trob/RenRC23}, leverages K-best joint sequences to create multiple search trees, and employs CBS-like search to resolve collisions for each tree. Despite its optimality guarantee, CBSS is computationally burdensome due to the duplicated collision resolutions across multiple trees and the computation of the K-best joint sequences. To address these challenges, we propose a novel algorithm called Improved Conflict-Based Steiner Search (ICBSS), aiming at expediting CBSS by replacing the multi trees with a single conflict tree (CT), which can be implemented by interleaving the time-dependent traveling salesman algorithm to compute the optimal joint path for agents under the newly generated constraints in each CT vertex. Additionally, we introduce a sub-optimal variant of ICBSS, which improves computational efficiency at the expense of solution optimality. Empirical results show that ICBSS outperforms state-of-the-art MCPF algorithms on a variety of MAPF instances.

10:10-10:15, Paper WeBT10.4	Add to My Program
Escaping Local Minima: Hybrid Artificial Potential Field with Wall-Follower for Decentralized Multi-Robot Navigation

Kim, Joonkyung	Sogang University
Park, Sangjin	Sogang University
Lee, Wonjong	Sogang University
Kim, Woojun	Carnegie Mellon University
Choi, Hyunga	Korea University
Doh, Nakju	Korea University
Nam, Changjoo	Sogang University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Reactive and Sensor-Based Planning Abstract: We tackle the challenge of decentralized multi-robot navigation in environments with nonconvex obstacles, where complete environmental knowledge is unavailable. While reactive methods like Artificial Potential Field (APF) offer simplicity and efficiency, they suffer from local minima, causing robots to become trapped due to their lack of global environmental awareness. Other existing solutions either rely on inter-robot communication, are limited to single-robot scenarios, or struggle to navigate nonconvex obstacles effectively. Our proposed method enables collision-free navigation using only local sensor and state information without a map. By incorporating a wall-following (WF) behavior into the APF approach, our method allows robots to escape local minima, even in the presence of nonconvex and dynamic obstacles including other robots. We introduce two algorithms for switching between APF and WF: a rule-based system and an encoder network trained on expert demonstrations. Experimental results show that our approach achieves substantially higher success rates compared to state-of-the-art methods, highlighting its ability to overcome the limitations of local minima in complex environments.

10:15-10:20, Paper WeBT10.5	Add to My Program
Heterogeneous Exploration and Monitoring with Online Free-Space Ellipsoid Graphs

Brodt, Brennan	Boston University
Pierson, Alyssa	Boston University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots Abstract: This paper proposes a heterogeneous teaming solution to the problem of target discovery and monitoring in unknown, non-convex environments. The team consists of two types of agents: agile agents with sensors capable of mapping their surroundings and slower agents that are capable of monitoring or servicing discovered targets. We propose an exploration algorithm that utilizes the IRIS algorithm to generate a graph decomposition from collision free ellipses contained within the environment. This graph is passed to the monitoring agents who execute polynomial complexity assignment and touring algorithms to generate high quality path plans which service all discovered targets. Our algorithmic structure allows the team to solve the problems of exploration, target discovery, assignment, and monitoring within unknown, non-convex environments efficiently using limited information. The performance of our proposed method is verified through batch simulations and complexity analysis.

10:20-10:25, Paper WeBT10.6	Add to My Program
Wavelet-Based Distributed Coverage for Heterogeneous Agents

Rao, Ananya	Carnegie Mellon University
Choset, Howie	Carnegie Mellon University
Wettergreen, David	Carnegie Mellon University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Field Robots Abstract: We develop a coverage approach for heterogeneous agents that leverages the different sensing and motion capabilities of a team. Coverage performance is measured using ergodicity, which when optimized balances exploitation versus exploration, where areas of interest are indicated with an information metric. Prior work uses spectral decomposition of a spatial map of information to guide a set of heterogeneous agents, each with different sensor and motion models, to optimize coverage. This work leverages wavelet transforms to decompose the information map rather than the Fourier transform typically applied to ergodic search and demonstrates the importance of selecting a suitable wavelet family to use, based on the information map being explored. Further a sequence of wavelets is used for decomposition to overcome dependency on selecting one suitable wavelet family. Our experimental results show that using wavelet families well-suited to the specific information map for information map decomposition leads to, on average, 43% improvement over a baseline method in terms of a standard coverage metric (ergodicity), while using a well-sequenced set of wavelets for decomposition leads to a 65% improvement in coverage performance across multiple types of information maps.

10:25-10:30, Paper WeBT10.7	Add to My Program
Multi-Agent Obstacle Avoidance Using Velocity Obstacles and Control Barrier Functions

Sánchez Roncero, Alejandro	KTH Royal Institute of Technology
Cabral Muchacho, Rafael Ignacio	KTH Royal Institute of Technology
Ogren, Petter	Royal Institute of Technology (KTH)
Keywords: Collision Avoidance, Multi-Robot Systems, Formal Methods in Robotics and Automation Abstract: Velocity Obstacles (VO) methods form a paradigm for collision avoidance strategies among moving obstacles and agents. While VO methods perform well in simple multi-agent environments, they do not guarantee safety and can show overly conservative behavior in common situations. In this paper, we propose to combine a VO strategy for guidance with a Control Barrier Function approach for safety, which overcomes the overly conservative behavior of VOs and formally guarantees safety. We validate our method in a baseline comparison study, using second-order integrator and car-like dynamics. Results support that our method outperforms the baselines with respect to path smoothness, collision avoidance, and success rates.


WeBT11 Regular Session, 314	Add to My Program
Micro/Nano Robots

Chair: Alapan, Yunus	University of Wisconsin-Madison
Co-Chair: Yoon, Jungwon	Gwangju Institutue of Science and Technology

09:55-10:00, Paper WeBT11.1	Add to My Program
VALG: Vision-Based Adaptive Laser Gripper for Model-Free Pose Control of Floating Objects at Air-Liquid Interface

Hui, Xusheng	Northwestern Polytechnical University
Luo, Jianjun	Northwestern Polytechnical University(P.R.China)
You, Haonan	Northwestern Polytechnical University
Keywords: Micro/Nano Robots, Robust/Adaptive Control, Grippers and Other End-Effectors Abstract: Non-contact manipulation at the air-liquid interface holds significant potential for applications in microrobotics, non-invasive assembly, and biochemistry analysis. However, achieving simultaneous position and orientation (pose) control of floating objects remains a considerable challenge, particularly for adaptive control without prior modeling of the objects. Here, we introduce the Vision-based Adaptive Laser Gripper (VALG) system addressing these challenges. By leveraging the distributed thermocapillary flow induced by patterned laser scanning, a pose control strategy based on the equidistant contour scanning laser is proposed and validated. The proposed system relies solely on visual recognition to generate adaptive laser grippers, which achieve static equilibrium to simultaneously constrain the position and orientation of the floating objects. Experimental validation demonstrates the effectiveness of the VALG system in independent position and orientation control, coupled pose control, and path following. The VALG system facilitates smooth, precise, fast, and adaptive pose control of generalized floating objects, establishing it as a universal and versatile platform for non-contact manipulation at the air-liquid interface.

10:00-10:05, Paper WeBT11.2	Add to My Program
Interactive OT Gym: A Reinforcement Learning-Based Interactive Optical Tweezer (OT)-Driven Microrobotics Simulation Platform

Zongcai, Tan	Imperial College London
Zhang, Dandan	Imperial College London
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots Abstract: Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complex-shaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field simulations and integrates haptic feedback interfaces, RL modules, and context-aware shared control strategies tailored for OT-driven microrobot in cooperative biological object manipulation tasks. This integration allows for an adaptive blend of manual and autonomous control, enabling seamless transitions between human input and autonomous operation. We evaluated the effectiveness of our platform using a cell manipulation task. Experimental results show that our shared control system significantly improves micromanipulation performance, reducing task completion time by approximately 67% compared to using pure human or RL control alone and achieving a 100% success rate. With its high fidelity, interactivity, low cost, and high-speed simulation capabilities, Interactive OT Gym serves as a user-friendly training and testing environment for the development of advanced interactive OT-driven micromanipulation systems and control algorithms.

10:05-10:10, Paper WeBT11.3	Add to My Program
Model-Based Robotic Cell Aspiration: Tackling the Impact of Air Segment

Zheng, Jiachun	Chinese University of HongKong, Shenzhen
Zhang, Zhuoran	The Chinese University of Hong Kong, Shenzhen
Keywords: Automation at Micro-Nano Scales, Biological Cell Manipulation Abstract: Cell aspiration is a common micro-manipulation technique for cell transfer, particularly in textit{in vitro} fertilization (IVF) procedures. The minuscule volume of a cell (pL) and limited damping provided by the medium make it challenging to accurately and quickly aspirate a cell to the desired position inside the micropipette. Experienced clinicians intentionally insert an air segment inside the micropipette in advance to make the aspiration easier. Nevertheless, the unclear damping effects and the varying initial length of the air segment in each aspiration pose difficulties for most operators. Inadequate judgment and response may lead to overshoot or even loss of the cell. This paper constructs a nonlinear dynamics model to elucidate the cell motion inside a micropipette containing an inserted air segment. The model reveals the impact of the air segment. A model-based controller is designed to facilitate the accurate aspiration of human sperm to a desired position, incorporating an estimated initial length of the air segment. Experiments were conducted to quantitatively evaluate the performance of both the model and the controller involving various initial air segment lengths. The results demonstrated a 100% success rate in 50 sperm aspiration experiments, achieving an average positional accuracy within pm2 pixels and an average settling time of 5.89 seconds.

10:10-10:15, Paper WeBT11.4	Add to My Program
Efficient Optimization of a Permanent Magnet Array for a Stable 2D Trap

Müller, Ann-Sophia	German Cancer Research Center (DKFZ)
Jeong, Moonkwang	Deutsches Krebsforschungszentrum (DKFZ)
Tian, Jiyuan	German Cancer Research Center
Zhang, Meng	German Cancer Research Center (DKFZ)
Qiu, Tian	German Cancer Research Center (DKFZ)
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Optimization and Optimal Control Abstract: Untethered magnetic manipulation of biomedical millirobots has a high potential for minimally invasive surgical applications. However, it is still challenging to exert high actuation forces on the small robots over a large distance. Permanent magnets offer stronger magnetic torques and forces than electromagnetic coils, however, feedback control is more difficult. As proven by Earnshaw's theorem, it is not possible to achieve a stable magnetic trap in 3D by static permanent magnets. Here, we report a stable 2D magnetic force trap by an array of permanent magnets to control a millirobot. The trap is located in an open space with a tunable distance to the magnet array in the range of 20 - 120mm, which is relevant to human anatomical scales. The design is achieved by a novel GPU-accelerated optimization algorithm that uses mean squared error (MSE) and Adam optimizer to efficiently compute the optimal angles for any number of magnets in the array. The algorithm is verified using numerical simulation and physical experiments with an array of two magnets. A millirobot is successfully trapped and controlled to follow a complex trajectory. The algorithm demonstrates high scalability by optimizing the angles for 100 magnets in under three seconds. Moreover, the optimization workflow can be adapted to optimize a permanent magnet array to achieve the desired force vector fields.

10:15-10:20, Paper WeBT11.5	Add to My Program
Real-Time 3D MPI-Based Navigation Scheme for Microrobots with Flexible Field Free Point Trajectories and Virtual FFP Intuitive Manipulation

Bui, Minh Phu	Gwangju Institute of Science and Technology
Park, Myungjin	Gwangju Institute of Science and Technology
Le, Tuan Anh	Gwangju Institute of Science and Technology
Yoon, Jungwon	Gwangju Institutue of Science and Technology
Keywords: Micro/Nano Robots, Medical Robots and Systems, Motion Control Abstract: Magnetic Particle Imaging (MPI)-based navigation shows significant potential for accurately guiding microrobots to desired target locations. Existing MPI-based navigation systems have been limited to two-dimensional planar movements due to increased computational load and a lack of efficient 3D actuator schemes. So we introduce a real-time 3D MPI-based navigation scheme for microrobot, utilizing a flexible field-free point (FFP) trajectory scanning scheme and 3D virtual FFP (vFFP) intuitive manipulation. The FFP trajectory is chosen flexibly to enhance temporal resolution. A virtual FFP force model for actuator function, with high potential for interactive manipulation, is used to linearize the magnetic force concerning the relative positions of microrobot and the actual FFP. The proposed concept has been validated using the available 3D amplitude modulation MPI system with a 90 mm bore size and a 4 T/m/µ0 gradient. By employing a flexible FFP trajectory, the MPI system can achieve an image sampling rate of up to 4 Hz for a 3D Field of View of 60  40  60 mm³, enabling real-time MPI-based navigation. Furthermore, the proposed navigation control strategy can reach any target outlet within the 3D blood model with a low mean error in vFFP linearization of less than 5%.

10:20-10:25, Paper WeBT11.6	Add to My Program
3D Noncontact Micro-Particle Manipulation with Acoustic Robot End-Effector under Microscope

Wang, Mingyue	Shanghaitech Univerisity
Li, Jiaqi	ShanghaiTech University
Jia, Yuyu	ShanghaiTech University
Sun, Zhenhuan	Shanghaitech University
Su, Hu	Institute of Automation, Chinese Academy of Science
Liu, Song	ShanghaiTech University
Keywords: Automation at Micro-Nano Scales, Visual Servoing, Grippers and Other End-Effectors Abstract: As an essential component of noncontact manipulation, acoustic manipulation has achieved great success in multidisciplinary research and applications. Although acoustic tweezers have made advancements in manipulating particles in air, handling individual particles with high precision in water remains challenging and inadequately addressed due to the difficulty in precisely characterizing and calibrating acoustic robot end-effectors from a robotic perspective. In this paper, we present a vision-based automated noncontact particle manipulation approach using an acoustic robot end-effector, which achieves precise and reliable particle manipulation in 3D space. Specifically, visual feedback is incorporated for microparticle localization, and a dynamic acoustic field modulation method is proposed for controlling the end-effector. The invisible robot end-effector is localized and characterized through hydrophone scanning. The proposed vision solution is capable of automated trapping and precise translation of micro-particles suspended in a water-based environment and is applicable to particles with both negative and positive impedance contrast against the medium. Experimental results demonstrate the effectiveness of this approach towards automated noncontact particle manipulation with an acoustic robot end-effector


WeBT12 Regular Session, 315	Add to My Program
Human-Robot Collaboration 2

Chair: Jain, Siddarth	Mitsubishi Electric Research Laboratories (MERL)
Co-Chair: Zhi, Jixuan	George Mason University

09:55-10:00, Paper WeBT12.1	Add to My Program
Dynamic Collaborative Workspace Based on Human Interference Estimation for Safe and Productive Human-Robot Collaboration

Kamezaki, Mitsuhiro	The University of Tokyo
Wada, Tomohiro	Waseda University
Sugano, Shigeki	Waseda University
Keywords: Human-Robot Collaboration, Human-Centered Automation, Industrial Robots Abstract: Collaborative robots that operate safely close to workers without fences have attracted attention, but few examples of such human-robot collaboration (HRC) have been seen in factories. The main reason is the difficulty in balancing safety and productivity. Current fenceless HRC systems stop the robot when a human enters the collaborative workspace (C) where both human and robot can work to ensure safety, which ISO/TS15066 regulates. The robot stops even when the human is far enough away, so productivity is drastically decreased (FCW, Fixed C). If a system could identify the human-work area, designate it as a no-entry space in C for the robot (C^P), and dynamically set the closed C (C^C) with shrinking C by C^P, productivity would improve thanks to enabling the robot to work in C^C and safety would be ensured thanks to allowing the human to continue working in C^P. In this study, we propose a new concept of a dynamic collaborative workspace (DCW) that dynamically sets C^C and C^P based on the human’s predicted trajectory. It also provides visual and auditory prompts to enable the human to understand DCW states, i.e., when a human enters C, C is changed, and the robot is in emergency mode. We compared four HRC systems using a real robot arm: two conventional FCW ones with and without fences and two proposed DCW ones with and without a state indicator and found that the proposed system with a state indicator has the best productivity and ensures the same level of safety as the conventional system with fences.

10:00-10:05, Paper WeBT12.2	Add to My Program
Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration

Renz, Heiko	TU Dortmund University
Krämer, Maximilian	TU Dortmund University
Hoffmann, Frank	Technische Universität Dortmund
Bertram, Torsten	Technische Universität Dortmund
Keywords: Human-Robot Collaboration, Reactive and Sensor-Based Planning, Optimization and Optimal Control Abstract: Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy’s effectiveness.

10:05-10:10, Paper WeBT12.3	Add to My Program
TriHRCBot: A Robotic Architecture for Triadic Human-Robot Collaboration through Mediated Object Alignment

Semeraro, Francesco	The University of Manchester
Leadbetter, James Hugo	BAE Systems Ltd
Cangelosi, Angelo	University of Manchester
Keywords: Human-Robot Collaboration, Human-Aware Motion Planning, Cognitive Control Architectures Abstract: Human-robot collaboration has great potential in enhancing robot deployment at close proximity with people, especially in non-dyadic collaborations with multiple users. However, autonomous systems that are capable of handling such interactions in a physical domain are rare. This work proposes TriHRCBot, a robotic architecture designed to handle a collaborative task that involves two concurrent users. The architecture is sensitive to position, orientation, body lengths and state of the users in the interaction, and uses this information to adjust the pose of a target object to enable both users to act on it at the same time. A robotic system equipped with the TriHRCBot architecture was deployed in a user study in which 30 participants from the BAE Systems Academy for Skills and Knowledge Centre interacted with it during such multi-user collaborative task. The study shows that the participants considered TriHRCBot acceptable for the task at hand.

10:10-10:15, Paper WeBT12.4	Add to My Program
Open-Nav: Exploring Zero-Shot Vision-And-Language Navigation in Continuous Environment with Open-Source LLMs

Qiao, Yanyuan	The University of Adelaide
Lyu, Wenqi	The University of Adelaide
Wang, Hui	The University of Adelaide, AIML
Wang, Zixu	South China University of Technology
Li, Zerui	Adelaide University
Zhang, Yuan	The University of Adelaide
Tan, Mingkui	South China University of Technology
Wu, Qi	University of Adelaide
Keywords: Human-Robot Collaboration, AI-Enabled Robotics, AI-Based Methods Abstract: Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closed-source large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in real-world applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM's reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.

10:15-10:20, Paper WeBT12.5	Add to My Program
Integrating Field of View in Human-Aware Collaborative Planning

Hsu, Ya-Chuan	University of Southern California
Michael, Defranco	University of Southern California
Patel, Rutvik Rakeshbhai	University of Southern California
Nikolaidis, Stefanos	University of Southern California
Keywords: Human-Robot Collaboration, Planning under Uncertainty, Human-Aware Motion Planning Abstract: In human-robot collaboration (HRC), it is crucial for robot agents to consider humans' knowledge of their surroundings. In reality, humans possess a narrow field of view (FOV), limiting their perception. However, research on HRC often overlooks this aspect and presumes an omniscient human collaborator. Our study addresses the challenge of adapting to the evolving subtask intent of humans while accounting for their limited FOV. We integrate FOV within the human-aware probabilistic planning framework. To account for large state spaces due to considering FOV, we propose a hierarchical online planner that efficiently finds approximate solutions while enabling the robot to explore low-level action trajectories that enter the human FOV, influencing their intended subtask. Through user study with our adapted cooking domain, we demonstrate our FOV-aware planner reduces human's interruptions and redundant actions during collaboration by adapting to human perception limitations. We extend these findings to a virtual reality kitchen environment, where we observe similar collaborative behaviors.

10:20-10:25, Paper WeBT12.6	Add to My Program
PACE: Proactive Assistance in Human-Robot Collaboration through Action-Completion Estimation

De Lazzari, Davide	University of Padua
Terreran, Matteo	University of Padova
Giacomuzzo, Giulio	University of Padova
Jain, Siddarth	Mitsubishi Electric Research Laboratories (MERL)
Falco, Pietro	University of Padova
Carli, Ruggero	University of Padova
Romeres, Diego	Mitsubishi Electric Research Laboratories
Keywords: Human-Robot Collaboration, Assembly Abstract: This paper introduces the Proactive Assistance through action-Completion Estimation (PACE) framework, designed to enhance human-robot collaboration through real-time monitoring of human progress. PACE incorporates a novel method that combines Dynamic Time Warping (DTW) with correlation analysis to track human task progression from hand movements. PACE trains a reinforcement learning policy from limited demonstrations to generate a proactive assistance policy that synchronizes robotic actions with human activities, minimizing idle time and enhancing collaboration efficiency. We validate the framework through user studies involving 12 participants, showing significant improvements in interaction fluency, reduced waiting times, and positive user feedback compared to traditional methods.

10:25-10:30, Paper WeBT12.7	Add to My Program
Improving Human-Robot Collaboration Via Computational Design

Zhi, Jixuan	George Mason University
Lien, Jyh-Ming	George Mason University
Keywords: Service Robotics, Human-Aware Motion Planning, Simulation and Animation Abstract: When robots enter our day-to-day lives, the shared space surrounding humans and robots is critical for facilitating Human-Robot collaboration. The design of shared space should satisfy humans' preferences and robots' efficiency. This work uses the kitchen as an example to illustrate the importance of good space designs in enhancing collaboration. Given the kitchen boundary, food stations, counters, and recipes, the proposed method determines the optimal placement of stations and counters that meet the requirements of kitchen design rules and improve performance. The key technical challenge is that the optimization method usually evaluates thousands of designs, and each evaluation analyzes the traffic flow of the space, which must solve many motion planning problems. To address this technical challenge, we use a decentralized motion planner that can solve multi-agent motion planning efficiently. Our results indicate that optimized kitchen designs can provide noticeable performance improvement to Human-Robot collaboration.


WeBT13 Regular Session, 316	Add to My Program
Multifingered Hands

Chair: Schimmels, Joseph	Marquette University
Co-Chair: Allen-Blanchette, Christine	Princeton University

09:55-10:00, Paper WeBT13.1	Add to My Program
A Vision-Based Force/Position Fusion Actuation-Sensing Scheme for Tendon-Driven Mechanism

Chen, Shiwei	Harbin Institute of Technology
Deng, Zhiming	Harbin Institute of Technology
Gu, Haiyu	Harbin Institute of Technology
Wei, Cheng	Harbin Institute of Technology
Keywords: Multifingered Hands, Tendon/Wire Mechanism, Computer Vision for Automation Abstract: Current robotic sensing systems typically employ multiple sensors to obtain position and force information. This usually leads to many challenges, such as high costs and complex wiring. In this paper,a vision-based force/position fusion actuation-sensing scheme is proposed. The scheme can measure the angles and torques of all joints with only one low-cost camera. Through careful design of the actuation-sensing mechanism, the camera can achieve high resolution and high bandwidth processing. The proposed angle measurement model and external torque measurement model are evaluated by rigorous experiments. The experimental results indicate that the designed mechanism shows excellent repeatability and accuracy. The average error for all angles is less than 1 degree, and the average maximum relative error for torque is 4.43%.

10:00-10:05, Paper WeBT13.2	Add to My Program
BODex: Scalable and Efficient Robotic Dexterous Grasp Synthesis Using Bilevel Optimization

Chen, Jiayi	Peking University
Ke, Yubin	Peking University
Wang, He	Peking University
Keywords: Grasping, Multifingered Hands, Big Data in Robotics and Automation Abstract: Robotic dexterous grasping is important for interacting with the environment. To unleash the potential of data-driven models for dexterous grasping, a large-scale, high-quality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, previous works suffer from limitations, such as inefficiency, strong assumptions in the grasp quality energy, or limited object sets for experiments. Moreover, the lack of a standard benchmark for comparing different methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. We formulate grasp synthesis as a bilevel optimization problem, combining a novel lower-level quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDA-accelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single 3090 GPU. Our synthesized grasps for Shadow, Allegro, and Leap hands all achieve a success rate above 75% in simulation, with a penetration depth under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a success rate from around 40% to 80% in simulation. Real-world testing of the trained model on the Shadow Hand achieves an 81% success rate across 20 diverse objects. The codes and datasets are released on our project page: https://pku-epic.github.io/BODex.

10:05-10:10, Paper WeBT13.3	Add to My Program
DemoStart: Demonstration-Led Auto-Curriculum Applied to Sim-To-Real with Multi-Fingered Robots

Bauza Villalonga, Maria	Massachusetts Institute of Technology
Chen, Jose Enrique	DeepMind
Dalibard, Valentin	Google DeepMind
Gileadi, Nimrod	Google
Hafner, Roland	Google DeepMind
Martins, Murilo	DeepMind
Moore, Joss	Google DeepMind
Pevceviciute, Rugile	Deepmind
Laurens, Antoine, Marin, Alix	EPFL
Rao, Dushyant	Google DeepMind
Zambelli, Martina	Google DeepMind
Riedmiller, Martin	DeepMind
Scholz, Jonathan	Google Deepmind
Bousmalis, Konstantinos	DeepMind
Nori, Francesco	Google DeepMind
Heess, Nicolas	Google Deepmind
Keywords: Multifingered Hands, Reinforcement Learning, Dexterous Manipulation Abstract: We present DemoStart, a novel auto-curriculum reinforcement learning method capable of learning complex manipulation behaviors on an arm equipped with a three-fingered robotic hand, from only a sparse reward and a handful of demonstrations in simulation. Learning from simulation drastically reduces the development cycle of behavior generation, and domain randomization techniques are leveraged to achieve successful zero-shot sim-to-real transfer. Transferred policies are learned directly from raw pixels from multiple cameras and robot proprioception. Our approach outperforms policies learned from demonstrations on the real robot and requires 100 times fewer demonstrations, collected in simulation. More details and videos in https://sites.google.com/view/demostart.

10:10-10:15, Paper WeBT13.4	Add to My Program
Dexterous Assembly Using a Planar Hand Having Programmable Passive Compliance

Frye, Jacob	Marquette University
Schimmels, Joseph	Marquette University
Keywords: Compliance and Impedance Control, Multifingered Hands, Dexterous Manipulation Abstract: Special purpose compliant end-effectors are effective in realizing task-appropriate passive compliance. This paper presents a programmable, 3-fingered, antagonistic, compliant hand (P3ACH) capable of realizing a desired compliant behavior within a large space of multidirectional compliant behaviors. Manipulation dexterity is demonstrated by performing different assembly tasks faster, more robustly, and with lower contact forces than an active system realizing the same compliant behavior.

10:15-10:20, Paper WeBT13.5	Add to My Program
GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping

Zhong, Tao	Princeton University
Allen-Blanchette, Christine	Princeton University
Keywords: Multifingered Hands, Deep Learning in Grasping and Manipulation, Dexterous Manipulation Abstract: We propose GAGrasp, a novel framework for dexterous grasp generation that leverages geometric algebra representations to enforce equivariance to SE(3) transformations. By encoding the SE(3) symmetry constraint directly into the architecture, our method improves data and parameter efficiency while enabling robust grasp generation across diverse object poses. Additionally, we incorporate a differentiable physics-informed refinement layer, which ensures that generated grasps are physically plausible and stable. Extensive experiments demonstrate the model's superior performance in generalization, stability, and adaptability compared to existing methods.

10:20-10:25, Paper WeBT13.6	Add to My Program
Model Q-II: An Underactuated Hand with Enhanced Grasping Modes and Primitives for Dexterous Manipulation

Dong, Yinkai	Harvard University
Kim, Jehyeok	Yale University
Patel, Vatsal	Yale University
Feng, Huijuan	Southern University of Science and Technology
Dollar, Aaron	Yale University
Keywords: Grippers and Other End-Effectors, Mechanism Design, Multifingered Hands Abstract: This paper introduces Model Q-II, an enhanced underactuated robotic hand designed to improve dexterous manipulation through expanded grasping modes and manipulation primitives. The Model Q-II incorporates tripod and enhanced power grasping modes, achieving increased versatility without adding additional actuators. The design employs passive mechanisms, such as lateral contact walls and a finger-locking system, to facilitate seamless transitions between modes, enabling precise pinch-to-tripod and pinch-to-power gating. These enhancements allow the hand to perform complex in-hand manipulations, including multi-directional object positioning. Theoretical analysis, simulations, and experimental evaluations validate the hand’s performance, demonstrating improved grasping force, range, and manipulation capabilities. The results highlight Model Q-II’s ability to handle various tasks, offering a robust, cost-effective solution for applications requiring both precise and powerful grasping.

10:25-10:30, Paper WeBT13.7	Add to My Program
Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning

Wu, Tianhao	Peking University
Li, Jinzhou	Cornell University
Zhang, Jiyao	Peking University
Mingdong Wu, Aaron	Peking University
Dong, Hao	Peking University
Keywords: Dexterous Manipulation, Multifingered Hands, Force and Tactile Sensing Abstract: Tactile sensing plays a vital role in enabling robots to perform fine-grained, contact-rich tasks. However, the high dimensionality of tactile data, due to the large coverage on dexterous hands, poses significant challenges for effective tactile feature learning, especially for 3D tactile data, as there are no large standardized datasets and no strong pretrained backbones. To address these challenges, we propose a novel canonical representation that reduces the difficulty of 3D tactile feature learning and further introduces a force-based self-supervised pretraining task to capture both local and net force features, which are crucial for dexterous manipulation. Our method achieves an average success rate of 78% across four fine-grained, contact-rich dexterous manipulation tasks in real-world experiments, demonstrating effectiveness and robustness compared to other methods. Further analysis shows that our method fully utilizes both spatial and force information from 3D tactile data to accomplish the tasks. The videos can be viewed at https://3dtacdex.github.io.


WeBT14 Regular Session, 402	Add to My Program
Tracking and Prediction 3

Chair: Usher, Colin	Georgia Tech Research Institute
Co-Chair: Vitzilaios, Nikolaos	University of South Carolina

09:55-10:00, Paper WeBT14.1	Add to My Program
Dynamic Compact Consensus Tracking for Aerial Robots

Sun, XiaoLou	Southeast University
Quan, Zhibin	Southeast University
Zhang, Feng	Nanjing University of Posts and Telecommunications
Li, Yuntian	PML
Wang, Chunyan	Purple Mountain Laboratories
Si, Wufei	Purple Mountain Laboratories
Ni, Wenhui	Purple Mountain Laboratory
Guan, Runwei	University of Liverpool
Wu, Yuan	Purple Mountain Lab
Meng, Shen	Purple Mountain Laboratories
Huang, YongMing	PML
Keywords: Visual Tracking, Deep Learning Methods, Visual Learning Abstract: Existing one-stream trackers have attracted widespread attention. However, they are not applicable in realtime UAV tracking systems due to substantial computational overhead, especially when dynamic templates are introduced. To address this issue, we propose a novel Dynamic Compact Consensus Tracker (DC2T), constructed by stacking modules that each consists of a Compact Token Encoder (CTE) and Dynamic Consensus Attention (DCA). Unlike traditional methods that convert images into a large number of tokens, the CTE, inspired by ”superpixel”, extracts a compact set of representative tokens from both initial and dynamic templates, eliminating the need for a large token set. This strategic reduction in the number of compact tokens markedly decreases the computational load of CTE, enhancing the efficiency of subsequent attention operations. To achieve near-linear complexity of the DCA, compact dynamic template tokens (as keys) are re-queried by search tokens (as queries) to perform dynamic consensus on the aggregated tokens (as values). This arrangement seamlessly incorporates dynamic spatio-temporal features into the DCA while avoiding the computational burden typically associated with dynamic templates. With the aim of further enhancing the system’s responsiveness and accuracy, a direct control network is crafted to seamlessly incorporate the prediction of high-level control values into the tracking network, ensuring a cohesive and efficient interaction with the controller. Comprehensive experiments and real-world evaluations have proven DC2T’s superior performance, accompanied by a significant reduction in FLOPs. Furthermore, we have conducted experiments that demonstrate the tracker’s ability to integrate seamlessly with other technologies such as SLAM and detection, enabling precise tracking of arbitrary objects. The tracker code will be released in https://github.com/xiaolousun/refine-pytracking.git.

10:00-10:05, Paper WeBT14.2	Add to My Program
CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking

Li, Weihong	University of Chinese Academy of Sciences
Liu, Xiaoqiong	University of North Texas
Fan, Heng	University of North Texas
Zhang, Libo	Iscas
Keywords: Visual Tracking, Computer Vision for Automation, Visual Learning Abstract: Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we in this paper introduce a novel family of UAV trackers, termed CGTrack, which combines both explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/NightwatchFox11/CGTrack

10:05-10:10, Paper WeBT14.3	Add to My Program
Tracking Everything in Robotic-Assisted Surgery

Zhan, Bohan	Imperial College London
Zhao, Wang	Tsinghua University
Fang, Yi	New York University
Du, Bo	Wuhan University
Vasconcelos, Francisco	University College London
Stoyanov, Danail	University College London
Elson, Daniel	Imperial College London
Huang, Baoru	Imperial College London
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy Abstract: Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos.

10:10-10:15, Paper WeBT14.4	Add to My Program
LaMOT: Language-Guided Multi-Object Tracking

Li, Yunhao	University of Chinese Academy of Sciences
Liu, Xiaoqiong	University of North Texas
Liu, Luke	Centennial High School
Fan, Heng	University of North Texas
Zhang, Libo	Iscas
Keywords: Visual Tracking, Computer Vision for Automation, Visual Learning Abstract: Vision-Language MOT is a critical tracking problem that has recently garnered increasing attention. It aims to track objects based on human language commands, displacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. However, a key challenge remains in understanding why language is used for tracking, hindering further development. In this paper, we introduce Language-Guided MOT, a unified task framework, and LaMOT, a corresponding large-scale benchmark, which encompasses diverse scenarios and language descriptions and comprises 1,660 sequences from 4 different datasets. The purpose of LaMOT is to unify various Vision-Language MOT tasks while providing a standardized evaluation platform. To ensure high-quality annotations, we manually assign appropriate descriptive texts to each target in every video and conduct careful inspection and correction. To our knowledge, LaMOT is the first benchmark dedicated to Language-Guided MOT. Additionally, we propose a simple yet effective tracker, termed LaMOTer. By establishing a unified task framework, providing challenging benchmarks, and offering insights for future algorithm design and evaluation, we expect to contribute to the advancement of research in Vision-Language MOT. We will release the data at https://github.com/Nathan-Li123/LaMOT.

10:15-10:20, Paper WeBT14.5	Add to My Program
Real-Time UAV Tracking: A Comparative Study of YOLOv8 with Object Tracking Algorithms

Russo, Tyler	University of South Carolina
Vitzilaios, Nikolaos	University of South Carolina
Keywords: Visual Tracking Abstract: Unmanned Aerial Vehicle (UAV) usage has rapidly increased leading to an effort to accurately and efficiently track UAVs. Many existing approaches utilize YOLO, a state-of-the art object detection model, in conjunction with object tracking algorithms to detect and follow UAVs in real-time. However, these systems typically focus on a single method, without considering alternative tracking methods. In this paper, we present an experimental comparison of multiple object tracking algorithms integrated with YOLOv8, offering a comprehensive evaluation of their performance in UAV tracking scenarios. First, the model size was optimized to determine the best balance between speed and accuracy. Then, various tracking methods are tested to determine the most effective combination. The YOLOv8 model combined with a Kernelized Correlation Filter outperformed various other trackers in varying environmental scenarios, with a combined success rate and tracking accuracy of 0.8041. This approach was further implemented in real-time on a Jetson Orion Nano GPU, utilizing a pan-tilt gimbal and an Intel RealSense D435i camera. Running at 20 FPS, the system demonstrated robustness and stability during motion and various environmental scenarios, highlighting its potential for integration into applications such as ground-based UAV surveillance.

10:20-10:25, Paper WeBT14.6	Add to My Program
MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction

Zhou, Heng	Columbia University
Guo, Zhetao	Cloudspace Technology Co., Ltd
Yuxiang, Ren	Beijing Dianjing Ciyuan Culture Communication Co. , Ltd.,
Liu, Shuhong	The University of Tokyo
Zhang, Lechen	Columbia University
Zhang, Kaidi	Columbia University
Li, Mingrui	Dalian University of Technology
Keywords: SLAM, Mapping, Localization Abstract: Monocular SLAM has received a lot of attention due to its simple RGB inputs and the lifting of complex sensor constraints. However, existing monocular SLAM systems lack accurate depth estimation, which limits the accuracy of tracking and mapping performance. To address this limitation, we propose MoD-SLAM, the first monocular NeRF-based dense mapping method that allows 3D reconstruction in real-time in unbounded scenes. Specifically, we introduce a depth estimation module in the front-end to extract accurate priori depth values to supervise mapping and tracking processes. This strategy is essential to improve the SLAM performance. Moreover, a Gaussian-based unbounded scene representation approach is designed to solve the challenge of mapping scenes without boundaries. By introducing a robust depth loss term into the tracking process, our SLAM system achieves more precise pose estimation in large-scale scenes. Our experiments on two standard datasets show that MoD-SLAM achieves competitive performance, improving the accuracy of the 3D reconstruction and localization by up to 30% and 15% respectively compared with existing monocular SLAM systems.

10:25-10:30, Paper WeBT14.7	Add to My Program
A Certifiable Algorithm for Simultaneous Shape Estimation and Object Tracking

Shaikewitz, Lorenzo	Massachusettes Institute of Technology
Ubellacker, Samuel	Massachusetts Institute of Technology
Carlone, Luca	Massachusetts Institute of Technology
Keywords: RGB-D Perception, Visual Tracking, Optimization and Optimal Control Abstract: Applications from manipulation to autonomous vehicles rely on robust and general object tracking to safely perform tasks in dynamic environments. We propose the first certifiably optimal category-level approach for simultaneous shape estimation and pose tracking of an object of known category (e.g. a car). Our approach uses 3D semantic keypoint measurements extracted from an RGB-D image sequence, and phrases the estimation as a fixed-lag smoothing problem. Temporal constraints enforce the object's rigidity (fixed shape) and smooth motion according to a constant-twist motion model. The solutions to this problem are the estimates of the object's state (poses, velocities) and shape (paramaterized according to the active shape model) over the smoothing horizon. Our key contribution is to show that despite the non-convexity of the fixed-lag smoothing problem, we can solve it to certifiable optimality using a small-size semidefinite relaxation. We also present a fast outlier rejection scheme that filters out incorrect keypoint detections with shape and time compatibility tests, and wrap our certifiable solver in a graduated non-convexity scheme. We evaluate the proposed approach on synthetic and real data, showcasing its performance in a table-top manipulation scenario and a drone-based vehicle tracking application.


WeBT15 Regular Session, 403	Add to My Program
Surgical Robotics: Laparoscopy

Chair: De Momi, Elena	Politecnico Di Milano
Co-Chair: Dagnino, Giulio	University of Twente

09:55-10:00, Paper WeBT15.1	Add to My Program
Hypergraph-Transformer (HGT) for Interaction Event Prediction in Laparoscopic and Robotic Surgery

Yin, Lianhao	MIT
Ban, Yutong	Shanghai Jiao Tong University
Eckhoff, Jennifer A	MGH
Meireles, Ozanan	MGH
Rus, Daniela	MIT
Rosman, Guy	Massachusetts Institute of Technology
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy Abstract: Understanding and anticipating events and actions is critical for intraoperative assistance and decision-making during minimally invasive surgery. We propose a predictive neural network that is capable of understanding and predicting critical interaction aspects of surgical workflow based on endoscopic, intracorporeal video data, while flexibly leveraging surgical knowledge graphs. The approach incorporates a hypergraph-transformer (HGT) structure that encodes expert knowledge into the network design and predicts the hidden embedding of the graph. We verify our approach on established surgical datasets and applications, including the prediction of action-triplets, and the achievement of the Critical View of Safety (CVS), which is a critical safety measure. Moreover, we address specific, safety-related forecasts of surgical processes, such as predicting the clipping of the cystic duct or artery without prior achievement of the CVS. Our results demonstrate improvement in prediction of interactive event when incorporating with our approach compared to unstructured alternatives.

10:00-10:05, Paper WeBT15.2	Add to My Program
Robotic Flexible Magnetic Retractor for Dynamic Tissue Manipulation in Endoscopic Submucosal Dissection

Chan, Wai Shing	The Chinese University of Hong Kong
Sun, Yichong	The Chinese University of Hong Kong
Li, Yehui	The Chinese University of Hong Kong
Li, Jixiu	The Chinese University of Hong Kong
Yip, Hon Chi	The Chinese University of Hong Kong
Chiu, Philip, Wai-yan	Chinese University of Hong Kong
Li, Zheng	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Surgical Robotics: Steerable Catheters/Needles Abstract: Endoscopic submucosal dissection (ESD) is a procedure targeted for early gastrointestinal cancer. Traction plays a crucial role in enhancing the efficiency of cutting lesions, thereby reducing procedural complexity and duration. From the perspective of traction devices, current non-magnetic ones hold shortcomings in complicating the workspace in directional tissue manipulation; Current magnetic traction devices cannot be prepared before the procedure, and require the withdrawal of endoscope in the midway to re-introduce the magnetic retractor to the lesion site. Towards these plights, this paper introduces a robotic flexible magnetic retractor designed for tissue manipulation during ESD. Precisely, the flexible prototype can be seamlessly inserted through the instrument channel of an endoscope to the lesion site without the need for endoscope withdrawal. Moreover, the introduction of robotic magnetic actuation enhances the agile control of magnetic retractors while alleviating the surgeon’s workload in magnetic-retractorassisted ESD. The experimental results validate the functionality and efficacy of the prototype magnetic retractor in magnetic traction-assisted ESD procedures. The retractor demonstrated its ability to provide adequate traction and accomplish clinical tasks. This innovative approach holds promise for enhancing the efficiency and outcomes of ESD procedures, offering a compelling alternative to traditional traction methods.

10:05-10:10, Paper WeBT15.3	Add to My Program
Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

Zhang, Jie	Huazhong University of Science and Technology
Zhou, Song	Huazhong University of Science and Technology
Wang, Yiwei	Huazhong University of Science and Technology
Wan, Chidan	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Cai, Xiong	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Planning, Recognition Abstract: Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.

10:10-10:15, Paper WeBT15.4	Add to My Program
SLAM Assisted 3D Tracking System for Laparoscopic Surgery

Song, Jingwei	University of Michigan
Zhang, Ray	University of Michigan
Zhang, Wenwei	Wuhan United Imaging Surgical Co., Ltd
Zhou, Hao	Shanghai United Imaging Healthcare Advanced Technology Research
Ghaffari, Maani	University of Michigan
Keywords: Surgical Robotics: Laparoscopy, Visual Tracking, SLAM Abstract: A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the ORB-SLAM2 monocular mode. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking, and the 3D shape is incorporated as a geometric prior in its pose graph optimization. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and ''organ-background'' relative motion.

10:15-10:20, Paper WeBT15.5	Add to My Program
SurgPose: Generalisable Surgical Instrument Pose Estimation Using Zero-Shot Learning and Stereo Vision

Rai, Utsav	Imperial College London
Xu, Haozheng	Imperial College London
Giannarou, Stamatia	Imperial College London
Keywords: Surgical Robotics: Laparoscopy, Localization, Visual Tracking Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.

10:20-10:25, Paper WeBT15.6	Add to My Program
Design and Effectiveness of Virtual Monitors and AR-Based Endoscope Control for Robotically Assisted Laparoscopic Surgery

Budjakoski, Nikola	ImFusion GmbH
Schneider, Dominik	German Aerospace Center (DLR)
Song, Tianyu	Technical University of Munich
Sommersperger, Michael	Technical University of Munich
Weber, Bernhard	German Aerospace Center
Navab, Nassir	TU Munich
Klodmann, Julian	German Aerospace Center
Keywords: Surgical Robotics: Laparoscopy, Virtual Reality and Interfaces Abstract: Managing indirect access in laparoscopy as a minimally invasive procedure poses challenges to physicians. In particular, an endoscope must be navigated to achieve adequate visualization of the surgical anatomy, while coping with unergonomic poses, tremor, and fatigue. Furthermore, the alignment of visual perception and physical movement, dictated by the endoscope's position relative to the monitor, can lead to hand-eye coordination challenges. We propose unified deployment of a robotic endoscope holder together with an augmented reality display to counteract the aforementioned challenges in laparoscopy. Our augmented reality system provides an interactive, stereoscopic, virtual monitor displaying an endoscopic stream. In addition, our method design enables direct control of the robotic endoscope holder. Our user study demonstrates the potential of the proposed method to significantly improve hand-eye coordination, while insights from our usability study for robotic control indicate promising trends, including high usability and low cognitive demand.

10:25-10:30, Paper WeBT15.7	Add to My Program
MEDiC: Autonomous Surgical Robotic Assistance to Maximizing Exposure for Dissection and Cautery

Liang, Xiao	University of California San Diego
Wang, Chung-Pang	University of California, San Diego
Shinde, Nikhil	University of California San Diego
Liu, Fei	University of Tennessee Knoxville
Richter, Florian	University of California, San Diego
Yip, Michael C.	University of California, San Diego
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Planning, Medical Robots and Systems Abstract: Surgical automation has the capability to improve the consistency of patient outcomes and broaden access to advanced surgical care in underprivileged communities. Shared autonomy, where the robot automates routine subtasks while the surgeon retains partial teleoperative control, offers great potential to make an impact. In this paper we focus on one important skill within surgical shared autonomy: Automating robotic assistance to maximize visual exposure and apply tissue tension for dissection and cautery. Ensuring consistent exposure to visualize the surgical site is crucial for both efficiency and patient safety. However, achieving this is highly challenging due to the complexities of manipulating deformable volumetric tissues that are prevalent in surgery. To address these challenges we propose MEDiC, a framework for autonomous surgical robotic assistance to maximizing exposure for dissection and cautery. We integrate a differentiable physics model with perceptual feedback to achieve our two key objectives: 1) Maximizing tissue exposure and applying tension for a specified dissection site through visual-servoing conrol and 2) Selecting optimal control positions for a dissection target based on deformable Jacobian analysis. We quantitatively assess our method through repeated real robot experiments on a tissue phantom, and showcase its capabilities through dissection experiments using shared autonomy on real animal tissue.


WeBT16 Regular Session, 404	Add to My Program
Deformable Object Manipulation

Chair: Hoffmann, Matej	Czech Technical University in Prague, Faculty of Electrical Engineering
Co-Chair: Sakcak, Basak	Maastricht University

09:55-10:00, Paper WeBT16.1	Add to My Program
DeformPAM: Data-Efficient Learning for Long-Horizon Deformable Object Manipulation Via Preference-Based Action Alignment

Chen, Wendi	Shanghai Jiao Tong University
Xue, Han	Shanghai Jiao Tong University
Zhou, Fangyuan	Shanghai Jiao Tong University
Fang, Yuan	Shanghai Jiaotong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Learning from Demonstration, Imitation Learning, Bimanual Manipulation Abstract: In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at deform-pam.robotflow.ai.

10:00-10:05, Paper WeBT16.2	Add to My Program
Autonomous Bimanual Manipulation of Deformable Objects Using Deep Reinforcement Learning Guided Adaptive Control

Liu, Jiayi	Huazhong University of Science and Technology
Yang, Sihang	Huazhong University of Science and Technology
Wang, Yiwei	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Deep Learning in Grasping and Manipulation Abstract: Deformable object manipulation (DOM) which is a common subtask in various surgical procedures represents an inevitable challenge in robot-assisted surgery (RAS) due to complex nonlinear deformation. This paper proposes a deep reinforcement learning guided adaptive control (RLAC) model-free framework, which combines learning-based and Jacobian-based methods. To complement each other for optimized performance, we harness the sampling of deep reinforcement learning (DRL) policy explored in simulations to solve a reasonable estimation of the initial deformation Jacobian. In early control iterations, the actions suggested by the DRL agent are adopted until the estimated real-time Jacobian approximates the actual deformation model. Subsequently, the independent Jacobian-based adaptive control (AC) with sufficient initial deformation awareness begins execution to achieve precise internal feature manipulation on deformable objects. Experimental results demonstrate that our method enables more efficient positioning and exhibits near-optimal positioning paths. RLAC with robust sim-to-real performance provides a feasible approach for the complex autonomous DOM in the real world.

10:05-10:10, Paper WeBT16.3	Add to My Program
Embedded IPC: Fast and Intersection-Free Simulation in Reduced Subspace for Robot Manipulation

Du, Wenxin	University of California, Los Angeles
Yu, Chang	University of California, Los Angeles
Ma, Siyu	University of California, Los Angeles
Jiang, Ying	University of California, Los Angeles
Zong, Zeshun	University of California, Los Angeles
Yang, Yin	University of Utah
Masterjohn, Joseph	Toyota Research Institute
Castro, Alejandro	Toyota Research Institute
Han, Xuchen	Toyota Research Institute
Jiang, Chenfanfu	University of California, Los Angeles
Keywords: Simulation and Animation, Contact Modeling Abstract: Physics-based simulation is essential for developing and evaluating robot manipulation policies, particularly in scenarios involving deformable objects and complex contact interactions. However, existing simulators often struggle to balance computational efficiency with numerical accuracy, especially when modeling deformable materials with frictional contact constraints. We introduce an efficient subspace representation for the Incremental Potential Contact (IPC) method, leveraging model reduction to decrease the number of degrees of freedom. Our approach decouples simulation complexity from the resolution of the input model by representing elasticity in a low-resolution subspace while maintaining collision constraints on an embedded high-resolution surface. Our barrier formulation ensures intersection-free trajectories and configurations regardless of material stiffness, time step size, or contact severity. We validate our simulator through quantitative experiments with a soft bubble gripper grasping and qualitative demonstrations of placing a plate on a dish rack. The results demonstrate our simulator's efficiency, physical accuracy, computational stability, and robust handling of frictional contact, making it well-suited for generating demonstration data and evaluating downstream robot training applications.

10:10-10:15, Paper WeBT16.4	Add to My Program
A Highly Robust Contact Sensor for Precise Contact Detection of Fabric

Ling, Zhengrong	The Hong Kong University of Science and Technology
Hong, Lanxuan	Hkust
Yang, Xiong	Hong Kong University of Science and Technology
Tang, Yifeng	City University of Hong Kong
Guo, Dong	City University of Hong Kong
Shen, Yajing	The Hong Kong University of Science and Technology
Keywords: Industrial Robots, Perception for Grasping and Manipulation, Contact Modeling Abstract: Automation in the apparel and textile industry has long been a pursuit. However, accurately locating the surface of a fabric remains a challenge, limiting the automation in sorting, packaging, and other processes. When humans locate clothing, they rely on contact feedback for the exact position of the clothing surface. As existing contact detection solutions are significantly affected by environmental factors, it is essential to develop a sensor with robust contact detection capabilities. In this work, we introduce a contact sensor with high robustness and high force resolution. This contact sensor detects contact by measuring the deformation of an elastomer using a distance-measuring module. Based on the deformation characteristics of the elastomer, we designed a detection algorithm that not only reduces the noise of data but also extracts features such as trends and elastomer states, enabling reliable contact detection. Through experiments, we validated that this contact sensor can detect contact forces as low as 0.017 N and is robust to external interference or sensor movement. We also verified that the sensor can process data within 7.5 ms and return contact detection with 95% accuracy. Additionally, we assessed its effectiveness in real fabric contact scenarios.

10:15-10:20, Paper WeBT16.5	Add to My Program
Design, Modelling, and Experimental Verification of Passively Adaptable Roller Gripper for Separating Stacked Fabric

Unde, Jayant	Nagoya University
Colan, Jacinto	Nagoya University
Hasegawa, Yasuhisa	Nagoya University
Keywords: Grippers and Other End-Effectors, Grasping, Contact Modeling Abstract: This study presents a novel approach to fabric manipulation through the development and optimization of a single-actuator-driven roller gripper. Focused on addressing the challenges inherent in handling fabrics with diverse thicknesses and materials, our gripper employs a passive adaptable mechanism driven by springs, enabling effective manipulation of fabrics ranging from 0.1mm to 2.25mm in thickness. We analyze gripper-fabric interaction forces to identify the parameters that influence successful grasping. We then optimize the gripper’s normal forces and the roller’s tangential force using the proposed model. Systematic evaluations demonstrated the gripper’s capability to separate individual layers from fabric stacks, achieving a 94.9% success rate across multiple fabric types. Overall, this research offers a compact, cost-effective solution with broad applicability in diverse industrial automation contexts, providing valuable insights for advancing robotic fabric handling systems. The gripper’s design is open-access and available for rapid development and customization at https://github.com/JayantUnde/Gripper.

10:20-10:25, Paper WeBT16.6	Add to My Program
Closed-Loop Shape Control of Deformable Linear Objects Based on Cosserat Model

Artinian, Azad	ISIR - Sorbonne Université
Ben Amar, Faiz	Université Pierre Et Marie Curie, Paris 6
Perdereau, Véronique	Sorbonne University
Keywords: Dual Arm Manipulation, Visual Servoing, Modeling, Control, and Learning for Soft Robots Abstract: The robotic shape control of deformable linear objects has garnered increasing interest within the robotics community. Despite recent progress, the majority of shape control approaches can be classified into two main groups: open-loop control, which relies on physically realistic models to represent the object, and closed-loop control, which employs less precise models alongside visual data to compute commands. In this work, we present a novel 3D shape control approach that includes the physically realistic Cosserat model into a closedloop control framework, using vision feedback to rectify errors in real-time. This approach capitalizes on the advantages of both groups: the realism and precision provided by physics-based models, and the rapid computation, therefore enabling real-time correction of model errors, and robustness to elastic parameter estimation inherent in vision-based approaches. This is achieved by computing a deformation Jacobian derived from both the Cosserat model and visual data. To demonstrate the effectiveness of the method, we conduct a series of shape control experiments where robots are tasked with deforming linear objects towards a desired shape.

10:25-10:30, Paper WeBT16.7	Add to My Program
Single-Grasp Deformable Object Discrimination: The Effect of Gripper Morphology, Sensing Modalities, and Action Parameters

Pliska, Michal	Czech Technical University in Prague, Faculty of Electrical Engi
Patni, Shubhan	Ceske Vysoke Uceni Technicke V Praze, FEL
Mareš, Michal	Faculty of Electrical Engineering, Czech Technical University In
Stoudek, Pavel	Technology Innovation Institute (TII), Abu Dhabi
Straka, Zdenek	Czech Technical University in Prague, Faculty of Electrical Engi
Stepanova, Karla	Czech Technical University
Hoffmann, Matej	Czech Technical University in Prague, Faculty of Electrical Engi
Keywords: Grippers and Other End-Effectors, Force and Tactile Sensing, Recognition, Multifingered Hands Abstract: In haptic object discrimination, the effect of gripper embodiment, action parameters, and sensory channels has not been systematically studied. We used two anthropomorphic hands and two 2-finger grippers to grasp two sets of deformable objects. On the object classification task, we found: (i) among classifiers, SVM on sensory features and LSTM on raw time series performed best across all grippers; (ii) faster compression speeds degraded performance; (iii) generalization to different grasping configurations was limited; transfer to different compression speeds worked well for the Barrett Hand only. Visualization of the feature spaces using PCA showed that the gripper morphology and the action parameters were the main source of variance, rendering generalization across embodiment or grasp configurations very hard. On the highly challenging dataset consisting of polyurethane foams alone, only the Barrett Hand achieved excellent performance. Tactile sensors can thus provide a key advantage even if recognition is based on stiffness rather than shape. The dataset with 24000 measurements is publicly available.


WeBT17 Regular Session, 405	Add to My Program
Soft Actuators 2

Chair: Markvicka, Eric	University of Nebraska-Lincoln
Co-Chair: Papadopoulos, Evangelos	National Technical University of Athens

09:55-10:00, Paper WeBT17.1	Add to My Program
Introducing Mag-Nets: Rapidly Bending Electromagnetic Actuators for Self-Contained Soft Robots

Bolanakis, Georgios	National Technical University of Athens
Papadopoulos, Evangelos	National Technical University of Athens
Keywords: Soft Sensors and Actuators, Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design Abstract: Present electromagnetic soft actuators rely on external magnetic fields or power supplies, while the very few that operate autonomously produce weak actuating forces, limiting their practicality. This work introduces a novel current-controlled electromagnetic actuator that employs copper coils and permanent magnets to produce substantial driving forces. The actuator can serve as a building block for independently controlled actuating networks to develop sophisticated self-contained soft robots and grippers. The design, inspired by fast pneu-net (fPN) actuators, ensures minimal bending resistance from the silicone body and, thus, allows high-speed bending motions. Two applications of the prototype actuator are studied; a two-fingered soft gripper realizing bending speeds of up to 1491°/s and maximum grasping force of 1.19 N, and an entirely self-contained crawling soft robot utilizing friction anisotropy to generate forward locomotion. A lumped-element model is developed and validated experimentally to describe the dynamics of the gripper’s soft finger. Pick-and-place tasks on various targets, and tests on the crawling robot demonstrate, overall, the effectiveness of the developed actuator. The uniqueness of Mag-Nets, lying in their control simplicity, enhanced capability and cost-effectiveness, sets the foundations for a new design approach for soft robots and grippers.

10:00-10:05, Paper WeBT17.2	Add to My Program
Miniature Dielectric Elastomer Actuator Probe Inspecting Confined Spaces Embedding a CMOS Sensor

Sandhu, Sahib	University of Connecticut
Li, Ang (Leo)	University of Toronto
Tugui, Codrin	University of Connecticut
Duduta, Mihai	University of Connecticut
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Soft Sensors and Actuators Abstract: Navigating and inspecting confined space is crucial for the aerospace and healthcare industries. Exploring smaller and narrower spaces allows for problems to be identified earlier, preventing negative outcomes for patients and equipment. The challenge is to scale down the navigation probe while preserving degrees of freedom (DOF) and functionality. Dielectric elastomer actuators (DEAs) are promising probe candidates because they are solid-state, electrical-driven, and can be scaled down favorably. This work demonstrates a modular 2-DOF DEA miniature probe with an embedded CMOS sensor for visual data acquisition. The modularity achieved by a novel hinge system enables switching between single and dual DEA probes based on 2D or 3D pathway structures. The probes can be controlled using a pocket-sized circuit with two knobs to turn. We present the operating mechanism, device assembly, fabrication, and characterization of DEA bending actuators with widths below 2mm. In the end, we demonstrate the ability of devices to navigate through various complex and confined pathways.

10:05-10:10, Paper WeBT17.3	Add to My Program
Portable, High-Frequency, and High-Voltage Control Circuits for Untethered Miniature Robots Driven by Dielectric Elastomer Actuators

Shao, Qi	Tsinghua University
Liu, Xin-Jun	Tsinghua University
Zhao, Huichan	Tsinghua University
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: In this work, we propose a high-voltage, high-frequency control circuit for the untethered applications of dielectric elastomer actuators (DEAs). The circuit board leverages low-voltage resistive components connected in series to control voltages of up to 1.8 kV within a compact size, suitable for frequencies ranging from 0 to 1 kHz. A single-channel control board weighs only 2.5 g. We tested the performance of the control circuit under different load conditions and power supplies. Based on this control circuit, along with a commercial miniature high-voltage power converter, we construct an untethered crawling robot driven by a cylindrical DEA. The 42-g untethered robots successfully obtained crawling locomotion on a bench and within a pipeline at a driving frequency of 15 Hz, while simultaneously transmitting real-time video data via an onboard camera and antenna. Our work provides a practical way to use low-voltage control electronics to achieve the untethered driving of DEAs, and therefore portable and wearable devices.

10:10-10:15, Paper WeBT17.4	Add to My Program
Stretchable Electrohydraulic Artificial Muscle for Full Motion Ranges in Musculoskeletal Antagonistic Joints

Kazemipour, Amirhossein	ETH Zürich
Hinchet, Ronan	ETH Zurich
Katzschmann, Robert Kevin	ETH Zurich
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Compliant Joints and Mechanisms Abstract: Artificial muscles play a crucial role in musculoskeletal robotics and prosthetics to approximate the force-generating functionality of biological muscle. However, current artificial muscle systems are typically limited to either contraction or extension, not both. This limitation hinders the development of fully functional artificial musculoskeletal systems. We address this challenge by introducing an artificial antagonistic muscle system capable of both contraction and extension. Our design integrates non-stretchable electrohydraulic soft actuators (HASELs) with electrostatic clutches within an antagonistic musculoskeletal framework. This configuration enables an antagonistic joint to achieve a full range of motion without displacement loss due to tendon slack. We implement a synchronization method to coordinate muscle and clutch units, ensuring smooth motion profiles and speeds. This approach facilitates seamless transitions between antagonistic muscles at operational frequencies of up to 3.2 Hz. While our prototype utilizes electrohydraulic actuators, this muscle-clutch concept is adaptable to other non-stretchable artificial muscles, such as McKibben actuators, expanding their capability for extension and full range of motion in antagonistic setups. Our design represents a significant advancement in the development of fundamental components for more functional and efficient artificial musculoskeletal systems, bringing their capabilities closer to those of their biological counterparts.

10:15-10:20, Paper WeBT17.5	Add to My Program
Beyond Traversing in a Thin Pipe: Self-Sensing Odometry of a Pipeline Robot Driven by High-Frequency Dielectric Elastomer Actuators

Cheng, Ran	Tsinghua University
Shao, Qi	Tsinghua University
Liu, Xin-Jun	Tsinghua University
Zhao, Huichan	Tsinghua University
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: In this paper, we propose an earthworm-inspired miniature pipeline robot capable of self-sensing odometry. The robot features a dielectric elastomer actuator as its elongation body and two specially designed passive anchors to achieve unidirectional motion without slipping. The odometry was achieved through the self-sensing scheme of DEAs and the summation of all step sizes over a period. The careful implementation of the self-sensing method resulted in a small sensing resolution of 0.05 mm at a high actuation frequency of 20 Hz for a cylindrical DEA. Finally, the robot obtained a self-sensing odometry in a pipe, showing good consistency with the ground truth. This work paves a new way for a miniature in-pipe robot to sense its own state without additional sensors to save space and power.

10:20-10:25, Paper WeBT17.6	Add to My Program
High-Force Electroadhesion Based on Unique Liquid-Solid Dielectrics for UAV Perching

Luo, Junjie	The Chinese University of Hong Kong
Li, Jisen	Shenzhen Institute of Artificial Intelligence and Robotics for S
Wang, Hongqiang	Southern University of Science and Technology
Zhu, Jian	Chinese University of Hong Kong, Shenzhen
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design Abstract: Electroadhesion (EA), as an electrostatically driven, controllable adhesion technology, has unique attributes such as low noise, robust adaptability, and energy efficiency. However, its adhesion pressure is still low (0.1~10kPa) which may significantly limit its applications. This paper presents an innovative electroadhesion pad embedded with liquid and solid dielectrics. The experiments demonstrate that this liquid-solid electroadhesion pad (LSEAP) is capable of much larger adhesion pressure, compared to the traditional solid electroadhesion pad (SEAP). On one hand, the LSEAP can increase the dielectric contact with the substrate. On the other hand, the actuator can increase its dielectric strength. We also explore application of this actuator to perching of a commercial Unmanned Aerial Vehicle (UAV), in order to promote the UAV’s sustainable flight. Notably, the untethered LSEAP system, with an adhesion area as small as 4 cm² and a self-weight as light as 8.7 g, can support a UAV of 249.7 g for stable adhesion on various surfaces. The adhesion pressure generated by our LSEAD can be 32.2kPa, significantly larger than those reported in the literature. The weight ratio of the UAV to the LSEAP system is 14.6, more than double those in the previous studies. The integration of this EA system markedly prolongs the operational duration of UAVs, rendering them suitable for sustainable surveillance and reconnaissance missions. This LSEAP also marks a pivotal advancement towards adhesion-based applications such as grippers and wall-climbing robots.


WeBT18 Regular Session, 406	Add to My Program
Intelligent Transportation Systems and AI-Based Methods

Chair: Chen, Dong	Mississippi State University
Co-Chair: Rosman, Guy	Massachusetts Institute of Technology

09:55-10:00, Paper WeBT18.1	Add to My Program
Multi-Scale Convolutional Networks with Class-Normalized Logit Clipping for Robust Sea State Estimation from Noisy Ship Motion Data

Qin, Xin	Tianjin University of Technology
Liu, Mengna	Tianjin University of Technology
Cheng, Xu	Smart Innovation Norway
Liu, Xiufeng	Technical University of Denmark
Shi, Fan	Tianjin University of Technology
Zhang, Jianhua	Tianjin University of Technology
Chen, Shengyong	Tianjin University of Technology
Keywords: Intelligent Transportation Systems, Deep Learning Methods, Big Data in Robotics and Automation Abstract: Autonomous ships utilize automation systems to achieve unmanned navigation, driving innovation in maritime transportation. However, sea conditions, inffuenced by dynamic factors such as wave height, wind speed, and ocean currents, present a challenge in accurately assessing these conditions. Traditional classification models often assume accurate labels, but noisy labels are prevalent in real-world applications. Existing methods, such as noise sample filtering or loss function adjustment, have limited applicability and poor generalization when dealing with complex sea condition data. To address this issue, this study proposes an end-to-end neural network model. The model’s feature extraction module uses deep representation learning to capture latent patterns in the data, and a loss function is designed to mitigate the impact of outliers. The integration of these components allows the model to perform accurate classification even in the presence of noisy labels. Extensive experiments on public and sea condition datasets validate the effectiveness of this approach, demonstrating that the model exhibits strong generalization capabilities and holds great promise for practical applications.

10:00-10:05, Paper WeBT18.2	Add to My Program
Directed-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles Via Proactive Attention

Tao, Yihang	City University of Hong Kong
Hu, Senkang	City University of Hong Kong
Fang, Zhengru	City University of Hong Kong
Fang, Yuguang	City Universty of Hong Kong
Keywords: Intelligent Transportation Systems, AI-Based Methods, Cooperating Robots Abstract: Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to expand an ego vehicle’s field of view (FoV). Despite recent progress, current CP methods do expand the ego vehicle’s 360-degree perceptual range almost equally, but faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Directed-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle’s directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8% higher local perception accuracy in interested directions and 2.5% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks.

10:05-10:10, Paper WeBT18.3	Add to My Program
Motion Forecasting Via Model-Based Risk Minimization

Distelzweig, Aron	Albert-Ludwigs-Universität Freiburg
Kosman, Eitan	Bosch
Andreas, Look	Bosch
Janjoš, Faris	Robert Bosch GmbH
Manivannan, Denesh Kumar	TU Delft
Valada, Abhinav	University of Freiburg
Keywords: Intelligent Transportation Systems, AI-Based Methods, Behavior-Based Systems Abstract: Forecasting the future trajectories of surrounding agents is crucial for autonomous vehicles to ensure safe, efficient, and comfortable route planning. While model en- sembling has improved prediction accuracy in various fields, its application in trajectory prediction is limited due to the multi-modal nature of predictions. In this paper, we propose a novel sampling method applicable to trajectory prediction based on the predictions of multiple models. We first show that conventional sampling based on predicted probabilities can degrade performance due to missing alignment between models. To address this problem, we introduce a new method that generates optimal trajectories from a set of neural networks, framing it as a risk minimization problem with a variable loss function. By using state-of-the-art models as base learners, our approach constructs diverse and effective ensembles for optimal trajectory sampling. Extensive experiments on the nuScenes prediction dataset demonstrate that our method surpasses current state-of-the-art techniques, achieving top ranks on the leaderboard. We also provide a comprehensive empirical study on ensembling strategies, offering insights into their effectiveness. Our findings highlight the potential of advanced ensembling techniques in trajectory prediction, significantly improving predictive performance and paving the way for more reliable predicted trajectories.

10:10-10:15, Paper WeBT18.4	Add to My Program
Computational Teaching for Driving Via Multi-Task Imitation Learning

Edakkattil Gopinath, Deepak	Toyota Research Institute
Cui, Xiongyi	Toyota Research Institute
DeCastro, Jonathan	Cornell University
Sumner, Emily	Toyota Research Institute
Costa, Jean	Toyota Research Institute
Yasuda, Hiroshi	Toyota Research Institute
Morgan, Allison	Toyota Research Institute
Dees, Laporsha	Toyota Research Institute
Chau, Sheryl	Toyota Research Institute
Leonard, John	MIT
Chen, Tiffany	Toyota Research Institute
Rosman, Guy	Massachusetts Institute of Technology
Balachandran, Avinash	Toyota Research Institue
Keywords: Human Performance Augmentation, Imitation Learning, Intelligent Transportation Systems Abstract: Learning motor skills for sports or performance driving is often done with professional instruction from expert human teachers, whose availability is limited. Our goal is to enable automated teaching via a learned model that interacts with the student similar to a human teacher. However, training such automated teaching systems is limited by the availability of high-quality annotated datasets of expert teacher and student interactions as they are difficult to collect at scale. To address this data scarcity problem, we propose an approach for training a coaching system for complex motor tasks such as high performance driving via a Multi-Task Imitation Learning (MTIL) paradigm. MTIL allows our model to learn robust representations by utilizing self supervised training signals from more readily available non- interactive datasets of humans performing the task of interest. We validate our approach with (1) a semi-synthetic dataset created from real human driving trajectories, (2) a professional track driving instruction dataset, (3) a track racing driving simulator human-subject study, and (4) a system demonstration on an instrumented car at a race track. Our experiments show that the right set of auxiliary machine learning tasks improves prediction of teaching instructions. Moreover, in the human subjects study, students exposed to the instructions from our teaching system improve their ability to stay within track limits, and show favorable perception of the model’s interaction with them, in terms of usefulness and satisfaction.

10:15-10:20, Paper WeBT18.5	Add to My Program
A Comprehensive LLM-Powered Framework for Driving Intelligence Evaluation

You, Shanhe	Institute for AI Industry Research, Tsinghua University
Luo, Xuewen	Monash University
Liang, Xinhe	National University of Singapore
Yu, Jiashu	Tsinghua University
Zheng, Chen	Institute for AI Industry Research, Tsinghua University
Gong, Jiangtao	Tsinghua University
Keywords: Human-Centered Automation, Intelligent Transportation Systems, Performance Evaluation and Benchmarking Abstract: Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a natural language evaluation dataset of human professional drivers and passengers through naturalistic driving experiments and post-driving behavior evaluation interviews. Based on this dataset, we developed an LLM-powered driving evaluation framework. The effectiveness of this framework was validated through simulated experiments in the CARLA urban traffic simulator and further corroborated by human assessment. Our research provides valuable insights for evaluating and designing more intelligent, human-like autonomous driving agents. The implementation details of the framework and detailed information about the dataset can be found at the Github.

10:20-10:25, Paper WeBT18.6	Add to My Program
LoRD: Adapting Differentiable Driving Policies to Distribution Shifts

Diehl, Christopher	TU Dortmund University
Karkus, Peter	NVIDIA
Veer, Sushant	NVIDIA
Pavone, Marco	Stanford University
Bertram, Torsten	Technische Universität Dortmund
Keywords: Intelligent Transportation Systems, Integrated Planning and Learning, Transfer Learning Abstract: Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks (structured policy) consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 9.93% in comparison to standard fine-tuning.

10:25-10:30, Paper WeBT18.7	Add to My Program
BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

Kulathun Mudiyanselage, Kasun Weerakoon	University of Maryland, College Park
Elnoor, Mohamed	University of Maryland
Seneviratne, Gershom Devake	University of Maryland, College Park
Rajagopal, Vignesh	University of Maryland, College Park
Arul, Senthil Hariharan	University of Maryland, College Park
Liang, Jing	University of Maryland
M Jaffar, Mohamed Khalid	University of Maryland, College Park
Manocha, Dinesh	University of Maryland
Keywords: Perception-Action Coupling, AI-Based Methods, Motion and Path Planning Abstract: We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Fréchet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods.


WeBT19 Regular Session, 407	Add to My Program
State Estimation

Chair: Kargar Tasooji, Tohid	University of Georgia
Co-Chair: Xiong, Xiaobin	University of Wisconsin Madison

09:55-10:00, Paper WeBT19.1	Add to My Program
An Adaptive Graduated Nonconvexity Loss Function for Robust Nonlinear Least Squares Solutions

Jung, Kyungmin	McGill University
Hitchcox, Thomas	McGill University
Forbes, James Richard	McGill University
Keywords: Graduated nonconvexity, Robust/Adaptive Control of Robotic Systems, SLAM, Learning and Adaptive Systems Abstract: Many problems in robotics, such as estimating the state from noisy sensor data or aligning two point clouds, can be posed and solved as least-squares problems. Unfortunately, vanilla nonminimal solvers for least-squares problems are notoriously sensitive to outliers and initialization errors. The conventional approach to outlier rejection is to use a robust loss function, which is typically selected and tuned a priori. A newly developed approach to handle large initialization errors is graduated nonconvexity (GNC), which is defined for a particular choice of a robust loss function. The main contribution of this paper is to combine these two approaches by using an adaptive kernel within a GNC optimization scheme. This produces least-squares problems that are robust to both outliers and initialization errors, without the need for model selection and tuning. Simulations and experiments demonstrate that the proposed method is more robust compared to non-GNC counterparts and performs on par with other GNC-tailored loss functions. An Example code can be found at https://github.com/decargroup/gnc-adapt.

10:00-10:05, Paper WeBT19.2	Add to My Program
Learning Direct Solutions in Moving Horizon Estimation with Deep Learning Methods

Lionti, Fabien	INRIA
Gutowski, Nicolas	University of Angers, LERIA
Aubin, Sébastien	DGA
Martinet, Philippe	INRIA
Keywords: Deep Learning Methods, Optimization and Optimal Control Abstract: State estimation in the context of dynamical systems is crucial for various applications, including control and monitoring. Moving Horizon Estimation (MHE) is an optimization-based state estimation algorithm that leverages a known dynamical model integrated over a moving horizon. The MHE optimization criterion corresponds to identify the initial state that best aligns the integrated trajectory with the system observation. In MHE setting, the state estimation performance increases with the considered length of the moving horizon but it can become computationally intensive which is a limiting factor for its applicability to fast-varying dynamical systems or on hardware with restricted computational power. Deep Learning (DL) methods can learn solutions to complex optimization problems without incurring any additional online computational cost beyond the inference of the considered architecture. In the context of state estimation we propose to study different type of DL architecture in order to provide full state estimation from partial and noisy system observations. The novel proposed method is based on an end-to-end differentiable formulation of the MHE optimization problem, enabling the offline training of a DL model to provide a state estimation that minimizes the MHE optimization criterion. Once training is completed, state estimations are generated through an explicit relationship learned by the DL model. The proposed method is compared to the online MHE formulation in various case studies, including scenarios with partially observed state and model discrepancies in the context of lateral vehicle dynamics. The results highlight improved state estimation performance both in terms of reduced computational time and accuracy with respect to the online MHE algorithm.

10:05-10:10, Paper WeBT19.3	Add to My Program
A Data-Driven Contact Estimation Method for Wheeled-Biped Robots

Gökbakan, Umit Bora	Inria
Dümbgen, Frederike	ENS, PSL University
Caron, Stephane	Inria
Keywords: Contact Modeling, Legged Robots, Probabilistic Inference Abstract: Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control. Existing approaches typically rely on gate-cycle priors or designated contact sensors. We design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features. To this end, we propose a Bayes filter in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements. We evaluate this approach in extensive real-robot and simulation experiments. Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline.

10:10-10:15, Paper WeBT19.4	Add to My Program
Simultaneous Ground Reaction Force and State Estimation Via Constrained Moving Horizon Estimation

Kang, Jiarong	University of Wisconsin Madison
Xiong, Xiaobin	University of Wisconsin Madison
Keywords: Sensor Fusion, Legged Robots, Humanoid and Bipedal Locomotion Abstract: Accurate ground reaction force (GRF) estimation can significantly improve the adaptability of legged robots in various real-world applications. For instance, with estimated GRF and contact kinematics, the locomotion control and planning assist the robot in overcoming uncertain terrains. The canonical momentum-based methods, formulated as nonlinear observers, do not fully address the noisy measurements and the dependence between floating-base states and the generalized momentum dynamics. In this paper, we present a simultaneous ground reaction force and state estimation framework for legged robots, which systematically addresses the sensor noise and the coupling between states and dynamics. With the floating base orientation estimated separately, a decentralized Moving Horizon Estimation (MHE) method is implemented to fuse the robot dynamics, proprioceptive sensors, exteroceptive sensors, and deterministic contact complementarity constraints in a convex windowed optimization. The proposed method is shown to be capable of providing accurate GRF and state estimation on several legged robots, including the custom-designed humanoid robot Bucky, the open-source educational planar bipedal robot STRIDE, and the quadrupedal robot Unitree Go1, with a frequency of 200Hz and a past time window of 0.04s.

10:15-10:20, Paper WeBT19.5	Add to My Program
FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator

Chen, Bang-Shien	National Taiwan Normal University
Lin, Yu-Kai	MediaTek Inc
Chen, Jian-Yu	National Central University
Huang, Chih-Wei	National Central University
Chern, Jann-Long	National Taiwan Normal University
Sun, Ching-Cherng	National Central University
Keywords: Optimization and Optimal Control, Mapping Abstract: Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba's rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43% improvement in the computation time.

10:20-10:25, Paper WeBT19.6	Add to My Program
Equivariant IMU Preintegration with Biases: A Galilean Group Approach

Delama, Giulio	University of Klagenfurt
Fornasier, Alessandro	University of Klagenfurt
Mahony, Robert	Australian National University
Weiss, Stephan	Universität Klagenfurt
Keywords: Localization, Sensor Fusion, Visual-Inertial SLAM Abstract: This letter proposes a new approach for Inertial Measurement Unit (IMU) preintegration, a fundamental building block that can be leveraged in different optimization-based Inertial Navigation System (INS) localization solutions. Inspired by recent advances in equivariant theory applied to biased INSs, we derive a discrete-time formulation of the IMU preintegration on Gal(3) ⋉ gal(3), the left-trivialization of the tangent group of the Galilean group Gal(3). We define a novel preintegration error that geometrically couples the navigation states and the bias leading to lower linearization error. Our method improves in consistency compared to existing preintegration approaches which treat IMU biases as a separate state-space. Extensive validation against state-of-the-art methods, both in simulation and with real-world IMU data, implementation in the Lie++ library, and open-source code are provided.

10:25-10:30, Paper WeBT19.7	Add to My Program
State Estimation for Continuum Multi-Robot Systems on SE(3)

Lilge, Sven	University of Toronto
Barfoot, Timothy	University of Toronto
Burgner-Kahrs, Jessica	University of Toronto
Keywords: Flexible Robots, State Estimation, Sensor Fusion, Parallel Robots Abstract: In contrast to conventional robots, accurately modeling the kinematics and statics of continuum robots is challenging due to partially unknown material properties, parasitic effects, or unknown forces acting on the continuous body. Consequentially, state estimation approaches that utilize additional sensor information to predict the shape of continuum robots have garnered significant interest. This paper presents a novel approach to state estimation for systems with multiple coupled continuum robots, which allows estimating the shape and strain variables of multiple continuum robots in an arbitrary coupled topology. Simulations and experiments demonstrate the capabilities and versatility of the proposed method, while achieving accurate and continuous estimates for the state of such systems, resulting in average end-effector errors of 3.3 mm and 5.02° depending on the sensor setup. It is further shown, that the approach offers fast computation times of below 10 ms, enabling its utilization in quasi-static real-time scenarios with average update rates of 100-200 Hz. An open-source C++ implementation of the proposed state estimation method is made publicly available to the community.


WeBT20 Regular Session, 408	Add to My Program
Agricultural Automation 1

Chair: Jiang, Yu	Cornell University
Co-Chair: Carpin, Stefano	University of California, Merced

09:55-10:00, Paper WeBT20.1	Add to My Program
IMU Augment Tightly Coupled Lidar-Visual-Inertial Odometry for Agricultural Environments

Hoang, Quoc Hung	Chungbuk National University
Kim, Gon-Woo	Chungbuk National University
Keywords: Agricultural Automation, SLAM, Robotics and Automation in Agriculture and Forestry Abstract: This paper presents a new tightly coupled LiDAR visual-odometry scheme for agricultural autonomous machinery under a structureless environment and the presence of fluctuation uncertainties. By proposing the robust adaptive filter, the effects of unknown disturbances and noises are significantly addressed. In the meantime, the IMU orientation is effectively estimated by the great capability of an error state Kalman filter (ESKF). The IMU attitude estimation is integrated to significantly improve the accuracy of both LiDAR and visual odometry. Hence, the suggested approach obtains the perfect output performance, smooth trajectory, and robustness against uncertainties. Finally, the effectiveness of the proposed LiDAR visual-odometry is confirmed with the real-time experiment of different scenarios.

10:00-10:05, Paper WeBT20.2	Add to My Program
Joint 3D Point Cloud Segmentation Using Real-Sim Loop: From Panels to Trees and Branches

Qiu, Tian	Cornell University
Du, Ruiming	Cornell University
Spine, Nikolai	Cornell University
Cheng, Lailiang	Cornell University
Jiang, Yu	Cornell University
Keywords: Robotics and Automation in Agriculture and Forestry, Field Robots, Data Sets for Robotic Vision Abstract: Modern orchards are planted in structured rows with distinct panel divisions to improve management. Accurate and efficient joint segmentation of point cloud from Panel to Tree and Branch (P2TB) is essential for robotic operations. However, most current segmentation methods focus on single-instance segmentation and depend on a sequence of deep networks to perform joint tasks. This strategy hinders the use of hierarchical information embedded in the data, leading to both error accumulation and increased costs for annotation and computation, which limits its scalability for real-world applications. In this study, we proposed a novel approach that incorporated a Real2Sim L-TreeGen for training data generation and a joint model (J-P2TB) designed for the P2TB task. The J-P2TB model, trained on the generated simulation dataset, was used for joint segmentation of real-world panel point clouds via zero-shot learning. Compared to representative methods, our model outperformed them in most segmentation metrics while using 40% fewer learnable parameters. This Sim2Real result highlighted the efficacy of L-TreeGen in model training and the performance of J-P2TB for joint segmentation, demonstrating its strong accuracy, efficiency, and generalizability for real-world applications. These improvements would not only greatly benefit the development of robots for automated orchard operations but also advance digital twin technology, enabling the facilitation of field robotics across various domains.

10:05-10:10, Paper WeBT20.3	Add to My Program
Energy Efficient Planning for Repetitive Heterogeneous Tasks in Precision Agriculture

Xie, Shuangyu	Texas A&M University
Goldberg, Ken	UC Berkeley
Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Keywords: Task Planning, Agricultural Automation, Robotics and Automation in Agriculture and Forestry Abstract: Robotic weed removal in precision agriculture introduces a repetitive heterogeneous task planning (RHTP) challenge for a mobile manipulator. RHTP has two unique characteristics: 1) an observe-first-and-manipulate-later (OFML) temporal constraint that forces a unique ordering of two different tasks for each target and 2) energy savings from efficient task collocation to minimize unnecessary movements. RHTP can be framed as a stochastic renewal process. According to the Renewal Reward Theorem, the expected energy usage per task cycle is the long-run average. Traditional task and motion planning focuses on feasibility rather than optimality due to the unknown object and obstacle position prior to execution. However, the known target/obstacle distribution in precision agriculture allows minimizing the expected energy usage. For each instance in this renewal process, we first compute task space partition, a novel data structure that computes all possibilities of task multiplexing and its probabilities with robot reachability. Then we propose a region-based set-coverage problem to formulate the RHTP as a mixed-integer nonlinear programming. We have implemented and solved RHTP using Branch-and-Bound solver. Compared to a baseline in simulations based on real field data, the results suggest a significant improvement in path length, number of robot stops, overall energy usage, and number of replans.

10:10-10:15, Paper WeBT20.4	Add to My Program
Leveraging LLMs for Mission Planning in Precision Agriculture

Zuzuarregui, Marcos	University of California, Merced
Carpin, Stefano	University of California, Merced
Keywords: Software Tools for Robot Programming, Robotics and Automation in Agriculture and Forestry, Agricultural Automation Abstract: Robotics and artificial intelligence hold significant potential for advancing precision agriculture. While robotic systems have been successfully deployed for various tasks, adapting them to perform diverse missions remains challenging, particularly because end users often lack technical expertise. In this paper, we present an end-to-end system that leverages large language models (LLMs), specifically ChatGPT, to enable users to assign complex data collection tasks to autonomous robots using natural language instructions. To enhance reusability, mission plans are encoded using an existing IEEE task specification standard, and are executed on robots via ROS2 nodes that bridge high-level mission descriptions with existing ROS libraries. Through extensive experiments, we highlight the strengths and limitations of LLMs in this context, particularly regarding spatial reasoning and solving complex routing challenges, and show how our proposed implementation overcomes them.

10:15-10:20, Paper WeBT20.5	Add to My Program
Hierarchical Tri-Manual Planning for Vision-Assisted Fruit Harvesting with Quadrupedal Robots

Liu, Zhichao	University of California, Riverside
Zhou, Jingzong	University of California, Riverside
Karydis, Konstantinos	University of California, Riverside
Keywords: Robotics and Automation in Agriculture and Forestry, Field Robots, Bimanual Manipulation Abstract: This paper addresses the challenge of developing a multi-arm quadrupedal robot capable of efficiently harvesting fruit in complex, natural environments. To overcome the inherent limitations of traditional bimanual manipulation, we introduce the first three-arm quadrupedal robot LocoHarv-3, that builds on top of the Spot quadruped, and propose a novel hierarchical tri-manual planning approach for automated fruit harvesting with collision-free trajectories between the built-in end-effector of Spot and our custom-made bimanual manipulator. Our comprehensive semi-autonomous framework integrates teleoperation, supported by LiDAR-based odometry and mapping, with learning-based visual perception for accurate fruit detection and pose estimation. Validation is conducted through a series of controlled indoor experiments using motion capture and extensive field tests in natural settings. Results demonstrate a 90% success rate in in-lab settings with a single attempt, and field trials further verify the system's robustness and efficiency in more challenging real-world environments.

10:20-10:25, Paper WeBT20.6	Add to My Program
Capacitated Agriculture Fleet Vehicle Routing with Implements and Limited Autonomy: A Model and a Two-Phase Solution Approach

Lopez-Sanchez, Aitor	Universidad Rey Juan Carlos
Lujak, Marin	University Rey Juan Carlos
Semet, Frederic	Centrale Lille
Billhardt, Holger	Universidad Rey Juan Carlos
Keywords: Path Planning for Multiple Mobile Robots or Agents, Agent-Based Systems, Robotics and Automation in Agriculture and Forestry Abstract: In this paper, we study the vehicle routing problem (VRP) for a fleet of cooperative autonomous agricultural robots (agribots) equipped with detachable implements, with the goal of efficiently and sustainably completing agricultural tasks in precision crop farming. State of the art in the area of agribot fleet routing with detachable implements is lacking. Consequently, we propose the Capacitated Agriculture Fleet Vehicle Routing Problem with Implements and Limited Autonomy (CAFVRPILA), designed to optimize the agribot fleet's routes across a set of given agricultural tasks while considering implement capacities, agribot-implement compatibilities, and agribots' limited battery autonomies. A heuristic two-phase decomposition approach is proposed for this problem. Simulation experiments show that minimizing travel distances and costs with CAFVRPILA enhances sustainable farming while maximizing productivity and resource use. The results also demonstrate that synchronizing multiple operations improves efficiency, particularly in larger fleets.

10:25-10:30, Paper WeBT20.7	Add to My Program
Towards Closing the Loop in Robotic Pollination for Indoor Farming Via Autonomous Microscopic Inspection

Kong, Chuizheng	Georgia Institute of Technology
Qiu, Alex	Georgia Institute of Technology
Wibowo, Idris	Georgia Institute of Technology
Ren, Marvin	Georgia Institute of Technology
Dhori, Aishik	Georgia Institute of Technology
Ling, Kai-Shu	United States Department of Agriculture - Agricultural Research
Hu, Ai-Ping	Georgia Tech Research Institute
Kousik, Shreyas	Georgia Institute of Technology
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Automation at Micro-Nano Scales Abstract: Effective pollination is a key challenge for indoor farming, since bees struggle to navigate without the sun. While a variety of robotic system solutions have been proposed, it remains difficult to autonomously check that a flower has been sufficiently pollinated to produce high-quality fruit, which is especially critical for self-pollinating crops such as strawberries. To this end, this work proposes a novel robotic system for indoor farming. The proposed hardware combines a 7-degree-of-freedom (DOF) manipulator arm with a custom end-effector, comprised of an endoscope camera, a 2-DOF microscope subsystem, and a custom vibrating pollination tool; this is paired with algorithms to detect and estimate the pose of strawberry flowers, navigate to each flower, pollinate using the tool, and inspect with the microscope. The key novelty is vibrating the flower from below while simultaneously inspecting with a microscope from above. Each subsystem is validated via extensive experiments.


WeBT21 Regular Session, 410	Add to My Program
Optimization and Optimal Control

Chair: Belvedere, Tommaso	CNRS
Co-Chair: Mastalli, Carlos	Heriot-Watt University

09:55-10:00, Paper WeBT21.1	Add to My Program
Embedded Robust Model Predictive Path Integral Control Using Sensitivity Tubes and GPU Acceleration

Falk Nyboe, Frederik	University of Southern Denmark
Afifi, Amr	University of Twente
Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Ebeid, Emad	University of Southern Denmark
Franchi, Antonio	University of Twente / Sapienza University of Rome
Keywords: Optimization and Optimal Control, Aerial Systems: Mechanics and Control, Embedded Systems for Robotic and Automation Abstract: This paper proposes a method to robustify model predictive path integral (MPPI) control by directly taking into account the effects of parameter uncertainty into the controller formulation. Leveraging the recent notion of closed-loop state sensitivity, the proposed MPPI can consider the state sensitivity against parameter mismatch as a part of the system state, and consequently exploit this additional information to address the challenge of model mismatch in sampling-based model predictive control. Using an obstacle avoidance scenario, we demonstrate the use of our approach to control an aerial robot. We present an embedded implementation of our method, utilizing parallelization of computations on a GPU. Finally, we show the increased robustness of our approach over a standard MPPI controller through hardware-in-the-loop simulations and validate its embedded real-time properties.

10:00-10:05, Paper WeBT21.2	Add to My Program
Guided Bayesian Optimization: Data-Efficient Controller Tuning with Digital Twin (I)

Nobar, Mahdi	ETH Zurich
Keller, Jürg	FHNW
Rupenyan, Alisa	Zurich University of Applied Sciences
Khosravi, Mohammad	TU Delft
Lygeros, John	ETH Zurich
Keywords: Optimization and Optimal Control, Calibration and Identification, Incremental Learning Abstract: This article presents the guided Bayesian optimization (BO) algorithm as an efficient data-driven method for iteratively tuning closed-loop controller parameters using a digital twin of the system. The digital twin is built using closed-loop data acquired during standard BO iterations, and activated when the uncertainty in the Gaussian Process model of the optimization objective on the real system is high. We define a controller tuning framework independent of the controller or the plant structure. Our proposed methodology is model-free, making it suitable for nonlinear and unmodelled plants with measurement noise. The objective function consists of performance metrics modeled by Gaussian processes. We utilize the available information in the closed-loop system to progressively maintain a digital twin that guides the optimizer, improving the data efficiency of our method. Switching the digital twin on and off is triggered by our data-driven criteria related to the digital twin's uncertainty estimations in the BO tuning framework. Effectively, it replaces much of the exploration of the real system with exploration performed on the digital twin. We analyze the properties of our method in simulation and demonstrate its performance on two real closed-loop systems with different plant and controller structures. The experimental results show that our method requires fewer experiments on the physical plant than Bayesian optimization to find the optimal controller parameters.

10:05-10:10, Paper WeBT21.3	Add to My Program
Enhancing Robotic System Robustness Via Lyapunov Exponent-Based Optimization

Fadini, Gabriele	ETHZ
Coros, Stelian	ETH Zurich
Keywords: Optimization and Optimal Control, Dynamics, Legged Robots Abstract: We present a novel differentiable approach to quantifying and optimizing stability in robotic systems addressing an open challenge in the field of robot analysis, control,design, and optimization. Our method leverages differentiable simulation over extended time horizons to estimate a robustness metric based on the Lyapunov exponents. The proposed metric offers several properties, including a natural extension to limit cycles (commonly encountered in robotics tasks and locomotion)and independence from the trajectory path for states converging to the attractor. We showcase, with an textit{ad-hoc} JAX gradient-based optimization framework, remarkable flexibility in tackling the robustness challenge. Our approach is tested through diverse scenarios of varying complexity, encompassing high-degree-of-freedom systems and contact-rich environments. The positive outcomes across these cases highlight the potential of our method in quantifying and possibly enhancing system robustness.

10:10-10:15, Paper WeBT21.4	Add to My Program
Endpoint-Explicit Differential Dynamic Programming Via Exact Resolution

Parilli, Maria	Universidad Simón Bolívar
Martinez, Sergi	Heriot-Watt
Mastalli, Carlos	Heriot-Watt University
Keywords: Optimization and Optimal Control, Multi-Contact Whole-Body Motion Planning and Control, Formal Methods in Robotics and Automation Abstract: We introduce a novel method for handling endpoint constraints in constrained differential dynamic programming (DDP). Unlike existing approaches, our method guarantees quadratic convergence and is exact, effectively managing rank deficiencies in both endpoint and stagewise equality constraints. It is applicable to both forward and inverse dynamics formulations, making it particularly well-suited for model predictive control (MPC) applications and for accelerating optimal control (OC) solvers. We demonstrate the efficacy of our approach across a broad range of robotics problems and provide a user-friendly open-source implementation within CROCODDYL.

10:15-10:20, Paper WeBT21.5	Add to My Program
Second-Order Stein Variational Dynamic Optimization

Aoyama, Yuichiro	Georgia Institute of Technology
Lehmann, Peter	Georgia Institute of Technology
Theodorou, Evangelos	Georgia Institute of Technology
Keywords: Optimization and Optimal Control, Constrained Motion Planning, Motion and Path Planning Abstract: We present a novel second-order trajectory optimization algorithm based on Stein Variational Newton's Method and Maximum Entropy Differential Dynamic Programming. The proposed algorithm, called Stein Variational Differential Dynamic Programming, is a kernel-based extension of Maximum Entropy Differential Dynamic Programming that combines the best of the two worlds of sampling-based and gradient-based optimization. The resulting algorithm avoids known drawbacks of gradient-based dynamic optimization in terms of getting stuck at local minima, while it overcomes limitations of sampling-based stochastic optimization in terms of introducing undesirable stochasticity when applied in online fashion. To test the efficacy of the proposed algorithm, experiments are conducted in Model Predictive Control mode. The experiments include comparisons with unimodal and multimodal Maximum Entropy Differential Dynamic Programming as well as Model Predictive Path Integral Control and its multimodal and Stein Variational extensions. The results demonstrate the superior performance of the proposed algorithms and confirm the hypothesis that there is a middle ground between sampling- and gradient-based optimization that is indeed beneficial for dynamic optimization.

10:20-10:25, Paper WeBT21.6	Add to My Program
Application of Koopman Direct Encoding-Based Model Predictive Control to Nonlinear Electromechanical Systems

Park, Sungbin	Korea Advanced Institute of Science and Technology
Kim, Won Dong	Korea Advanced Institute of Science & Technology (KAIST)
Jeon, Sangha	Korea Advanced Institute of Science and Technology(KAIST)
Kim, Jung	KAIST
Keywords: Optimization and Optimal Control, Dynamics, Contact Modeling Abstract: The Koopman operator framework has shown promising results in enabling the analysis of nonlinear dynamics into an infinite-dimensional linear representation. Koopman direct encoding (KDE) is a model-based approach that utilizes inner products and compositions in a Hilbert space to compute the Koopman operator. However, it has primarily been applied to autonomous systems and simulation environments. Here, we extend the application of KDE to nonautonomous systems and real-world environments by introducing Koopman direct encoding-based model predictive control (KDE-MPC). It was validated on nonlinear electromechanical systems with segmented dynamic conditions, such as contact-noncontact transitions, which pose challenges for modeling and control. Simulation results demonstrate a more stable and smoother position profile compared to proportional-integral-derivative control, particularly at discontinuous boundaries. KDE-MPC was also applied to real-world systems, achieving similar position tracking performance to simulation results. We anticipate that KDE-MPC will offer a viable solution for complex robotic control challenges.

10:25-10:30, Paper WeBT21.7	Add to My Program
Effective Search for Control Hierarchies within the Policy Decomposition Framework

Khadke, Ashwin	The AI Institute
Geyer, Hartmut	Carnegie Mellon University
Keywords: Optimization and Optimal Control, Evolutionary Robotics, Reinforcement Learning Abstract: Policy decomposition is a novel framework for approximating optimal control policies of complex dynamical systems with a hierarchy of policies derived from smaller but tractable subsystems. It stands out amongst the class of hierarchical control methods by estimating a priori how well the closed-loop behavior of different control hierarchies matches the optimal policy. However, the number of possible hierarchies grows prohibitively with the number of inputs and the dimension of the state-space of the system making it unrealistic to estimate the closed-loop performance for all hierarchies. Here, we present the development of two search methods based on Genetic Algorithm and Monte-Carlo Tree Search to tackle this combinatorial challenge, and demonstrate that it is indeed surmountable. We showcase the efficacy of our search methods and the generality of the framework by applying it towards finding hierarchies for control of three distinct robotic systems: a simplified biped, a planar manipulator, and a quadcopter. The discovered hierarchies, in comparison to heuristically designed ones, provide improved closed-loop performance or can be computed in minimal time with marginally worse control performance, and also exceed the control performance of policies obtained with popular deep reinforcement learning methods.


WeBT22 Regular Session, 411	Add to My Program
Learning Based Planning for Manipulation 2

Chair: Choi, Changhyun	University of Minnesota, Twin Cities
Co-Chair: Alt, Benjamin	ArtiMinds Robotics

09:55-10:00, Paper WeBT22.1	Add to My Program
Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects

Scheikl, Paul Maria	Johns Hopkins University
Schreiber, Nicolas	Karlsruhe Institute of Technology (KIT)
Haas, Christoph	Karlsruhe Institute of Technology (KIT)
Freymuth, Niklas	Karlsruhe Institute of Technology
Neumann, Gerhard	Karlsruhe Institute of Technology
Lioutikov, Rudolf	Karlsruhe Institute of Technology
Mathis-Ullrich, Franziska	Friedrich-Alexander-University Erlangen-Nurnberg (FAU)
Keywords: Surgical Robotics: Laparoscopy, Imitation Learning Abstract: Policy learning in robot-assisted surgery (RAS) lacks data efficient and versatile methods that exhibit the desired motion quality for delicate surgical interventions. To this end, we introduce Movement Primitive Diffusion (MPD), a novel method for imitation learning (IL) in RAS that focuses on gentle manipulation of deformable objects. The approach combines the versatility of diffusion-based imitation learning (DIL) with the high-quality motion generation capabilities of Probabilistic Dynamic Movement Primitives (ProDMPs). This combination enables MPD to achieve gentle manipulation of deformable objects, while maintaining data efficiency critical for RAS applications where demonstration data is scarce. We evaluate MPD across various simulated and real world robotic tasks on both state and image observations. MPD outperforms state-of-the-art DIL methods in success rate, motion quality, and data efficiency.

10:00-10:05, Paper WeBT22.2	Add to My Program
Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark

Li, Juncheng	Purdue University
Cappelleri, David	Purdue University
Keywords: Mobile Manipulation, Deep Learning in Grasping and Manipulation, Grasping Abstract: In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.

10:05-10:10, Paper WeBT22.3	Add to My Program
Controlled Robot Language with Frame Semantics (FrameCRL) for Autonomous Context-Aware High-Level Planning

Tran, Dang	University of Alabama
Yan, Fujian	Wichita State University
Zhang, Qiang	The University of Alabama
Zhang, Yinlong	Shenyang Institute of Automation, Chinese Academy of Sciences
He, Hongsheng	The University of Alabama
Keywords: AI-Based Methods, Human-Robot Collaboration, Dual Arm Manipulation Abstract: This paper proposes a configurable and scalable framework based on Controlled Robot Language with Frame Semantics (FrameCRL) for plan generation. Given natural language instructions, FrameCRL constructs an equivalent formal semantic formulation in the form of discourse representation structures (DRS). Imperative verbs are extracted from the semantic structures as keys to anchor relevant semantic frames from FrameNet, and the selected semantic frames are used to construct goal statements in planning language. Non-imperative statements are further analyzed to generate object specifications and the initial state of the planning problem. These generated statements are then merged into a single planning script, which can be solved directly by the integrated planner. The performance of FrameCRL was evaluated on various natural language corpora and compared with large language models (LLM) based methods in plan generation. The results demonstrated the outperformance of FrameCRL in generating high-quality plans and its capability to handle large context scenarios. The FrameCRL was also tested on pick-and-place tasks using a dual-arm robot and it showcased a robust performance in linguistic understanding.

10:10-10:15, Paper WeBT22.4	Add to My Program
Effective Tuning Strategies for Generalist Robot Manipulation Policies

Zhang, Wenbo	University of Adelaide
Li, Yang	Commonwealth Scientific and Industrial Research Organisation
Qiao, Yanyuan	The University of Adelaide
Huang, Siyuan	Shanghai Jiao Tong University
Liu, Jiajun	CSIRO
Dayoub, Feras	The University of Adelaide
Ma, Xiao	Dyson
Liu, Lingqiao	University of Adelaide
Keywords: Deep Learning in Grasping and Manipulation, Transfer Learning Abstract: Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, environments, and devices. However, existing policies continue to struggle with out-of-distribution scenarios, considering that the action data remains notoriously hard to collect. While fine-tuning offers a practical way to quickly adapt a GMP to novel domains and tasks with limited samples, we observe that the performance of the resulting GMP differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an in-depth empirical study to investigate the effect of key factors in GMP fine-tuning strategies, covering the action space, policy head, and the choice of tunable parameters, where over 2,500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMP fine-tuning. We observe that in a low-data regime, with carefully chosen fine-tuning strategies, a GMP significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs.

10:15-10:20, Paper WeBT22.5	Add to My Program
RM-Planner: Integrating Reinforcement Learning with Whole-Body Model Predictive Control for Mobile Manipulation

Zhuang, Zixuan	Sun Yat-Sen University
Zheng, Le	Sun Yat-Sen University
Li, Wanyue	The University of Hong Kong
Liu, Renming	Sun Yat-Sen University
Lu, Peng	The University of Hong Kong
Cheng, Hui	Sun Yat-Sen University
Keywords: Mobile Manipulation, AI-Enabled Robotics, Service Robotics Abstract: Mobile manipulation is a crucial problem in various real-world applications. However, existing methods have demonstrated unsatisfactory training efficiency and sparse rewards, requiring complex coordination strategies between the mobile base and arm. In this paper, we propose RM-Planner, a planning method for mobile manipulation tasks in unknown complex environments. By adopting a two-layer hierarchical framework, we utilize a whole-body Model Predictive Control (MPC)-based low-level planner to track subgoals and generate aggressive but safe joint commands throughout the entire manipulation process, while a Reinforcement Learning (RL)-based high-level policy directly uses 3D point cloud representations of the environment, guiding the robot to achieve optimal manipulation postures based on current observations and specific task objectives. We conduct extensive simulations and real-world experiments, where RM-planner significantly outperforms state-of-the-art methods. Our code will be released at href{https://github.com/SYSU-RoboticsLab/RM-Planner.git}{h ttps://github.com/SYSU-RoboticsLab/RM-Planner.git}.

10:20-10:25, Paper WeBT22.6	Add to My Program
Routing Manipulation of Deformable Linear Object Using Reinforcement Learning and Diffusion Policy

Li, Mingen	University of Minnesota Twin Cities
Yu, Houjian	University of Minnesota, Twin Cities
Choi, Changhyun	University of Minnesota, Twin Cities
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Imitation Learning Abstract: Tasks involving deformable linear objects (DLOs) are prevalent in daily life but pose significant challenges due to their infinite degrees of freedom and underactuated nature. Frequent contact between DLOs and surrounding objects with unknown physical parameters, such as friction, further complicates their manipulation. Performing tasks like routing ropes through a hole requires gentle yet robust manipulation, making it particularly challenging. Previous research has not adequately addressed general DLO manipulation tasks that involve intensive contact, especially in environments with rough surfaces. This paper presents a robust and delicate manipulation learning approach for the DLO routing task, leveraging reinforcement learning (RL) and diffusion policy. First, reinforcement learning agents are trained separately for rope insertion and pulling. During training, the agents are encouraged to minimize rope tension throughout task execution in environments with randomized friction to achieve delicate motion. Next, the rollouts from these agents are collected as expert demonstrations to train a diffusion policy. Our approach generates delicate motions to prevent the rope from being damaged or getting stuck on rough surfaces while remaining robust against environmental disturbances. Please refer to our project page: https://lmeee.github.io/DLOPull/

10:25-10:30, Paper WeBT22.7	Add to My Program
TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

Wang, Haoxiao	Tianjin University of Technology
Zhou, Kaichen	University of Oxford
Gu, Binrui	Peiking University
Feng, ZhiYuan	Tsinghua University
Wang, Weijie	Zhejiang University
Sun, Peilin	Zhejiang University
Xiao, Yicheng	Southeast University
Zhang, Jianhua	Tianjin University of Technology
Dong, Hao	Peking University
Keywords: AI-Based Methods, Grippers and Other End-Effectors Abstract: Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we, for the first time, propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on: url{https://transdiff.github.io/}


WeBT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 4

Chair: Valada, Abhinav	University of Freiburg
Co-Chair: Ding, Wenchao	Fudan University

09:55-10:00, Paper WeBT23.1	Add to My Program
Efficient Submap-Based Autonomous MAV Exploration Using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras

Papatheodorou, Sotiris	Imperial College London
Boche, Simon	Technical University of Munich
Barbas Laina, Sebastián	TU Munich
Leutenegger, Stefan	Technical University of Munich
Keywords: Aerial Systems: Perception and Autonomy, Reactive and Sensor-Based Planning Abstract: Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot's surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on on-board state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

10:00-10:05, Paper WeBT23.2	Add to My Program
Parking-SG: Open-Vocabulary Hierarchical 3D Scene Graph Representation for Open Parking Environments

Zhang, Yaowen	Beijing Institute of Technology
Ruan, Yi	Beijing Institute of Technology
Pan, Miaoxin	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Automation Technologies for Smart Cities, Mapping, Semantic Scene Understanding Abstract: Automatic Valet Parking (AVP) has garnered significant attention from industry and academia due to its potential to enhance traffic efficiency, parking safety, and user experience. While AVP technologies have been successfully applied in standard parking scenarios with clear markings, real-world parking environments are far more diverse and complex, posing challenges for current systems. To address these limitations, we present Parking-SG, an open-vocabulary hierarchical 3D scene graph representation, facilitating the application of AVP in open and complex environments. Our approach builds an object-based, open-vocabulary map that integrates both ground-level and ground-above objects for comprehensive environmental understanding. Leveraging common sense reasoning and object behavior relationships, various standard or non-standard parking spaces are inferred in open environments. Additionally, we extract and analyze path topology to construct a hierarchical map representation, supporting complex AVP tasks. Parking-SG is validated in both simulated and real-world environments, demonstrating its ability to generate rich environmental representations, accurately and flexibly infer parking spaces, and effectively perform complex AVP tasks.

10:05-10:10, Paper WeBT23.3	Add to My Program
3D Lane Detection Based on Projection-Consistent Reference Points and Intra & Inter-Lane Context

Bing, Yiqiu	Capital Normal University
Niu, Huilin	Capital Normal University
Zhang, Hong	Sensetime
Jiang, Na	Capital Normal University
Zhou, Zhong	BeiHang University
Geng, Qichuan	Capital Normal University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: 3D lane detection aims to identify lane categories and trends in 3D space, which is a vital and challenging task in autonomous driving. Existing methods introduce various priors to guide 3D lane prediction, which generally consist of a series of reference points for context aggregation. However, due to the misalignment between these reference points and the lanes, it is difficult to obtain complete and discriminative context for complex instances. In this paper, we are devoted to introducing 3D priors adaptive to lane appearances, which serve as references to aggregate the lane context. Specifically, we propose a projection-consistent reference generation strategy to keep the projected 3D reference points geometrically consistent with the corresponding lanes in images. In addition, a segmentation-lifting denoising strategy is designed to improve the ability of the model to map the lane segmentation into 3D space. To leverage more lane-related information, we propose a decoupled lane-context aggregation module by considering the perspectives of individual geometries and integrated layout, namely intra-lane and inter-lane context. Extensive experiments on the OpenLane dataset show that our approach outperforms previous methods and achieves the state-of-the-art performance. The code will be made publicly available.

10:10-10:15, Paper WeBT23.4	Add to My Program
Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

Zhang, Ludan	Nankai University
Ding, Xiaokang	School of Electronic and Information Engineering, Beijing Univer
Dai, Yuqi	Tsinghua University
He, Lei	Tsinghua University
Li, Keqiang	Tsinghua University
Keywords: Computer Vision for Automation, Deep Learning Methods, Representation Learning Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird’s-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.

10:15-10:20, Paper WeBT23.5	Add to My Program
Panoptic-Depth Forecasting

Juana Valeria, Hurtado	University of Freiburg
Mohan, Riya	Freiburg University
Valada, Abhinav	University of Freiburg
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Learning Abstract: Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting future panoptic segmentation and depth maps from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of future frames in a coherent manner. Furthermore, we present two baselines and propose the novel netname architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of netname across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de

10:20-10:25, Paper WeBT23.6	Add to My Program
Coarse-To-Fine Cross-Modality Generation for Enhancing Vehicle Re-Identification with High-Fidelity Synthetic Data

Jin, Leyang	National University of Singapore
Ji, Wei	Nanjing University
Chua, Tatseng	National University of Singapore
Zheng, Zhedong	University of Macau
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems Abstract: Due to the critical issues of privacy and partial occlusion, license plate information is not always available in vehicle recognition systems. Consequently, researchers have increasingly turned towards vehicle re-identification (reID) techniques to bridge the gap between cross-view camera systems. Despite the growing interest, one major challenge persists: the scarcity of authentic, large-scale training datasets. To address this challenge, this paper introduces a coarse-to-fine generation pipeline designed to synthesize high-fidelity vehicle data, thereby facilitating subsequent vehicle representation learning. Specifically, the proposed approach consists of three stages: Prompt Processing, Diffusion Fine-tuning, and Semantic Filtering. First, we collect detailed prompts from vehicle websites and companies with fine-grained vehicle prototype attributes. Next, we leverage the prior knowledge of these automotive prototypes to fine-tune diffusion models. Finally, to ensure the quality of the synthesized data, we employ pre-trained vision-language models to filter out substandard images. Building upon the high-quality data generated by this pipeline, we validate the effectiveness using vanilla models. Extensive experimental evaluations demonstrate that our approach achieves competitive accuracy on public benchmarks such as VeRi-776, VehicleID and CityFlowV2, and is compatible with various model architectures.

10:25-10:30, Paper WeBT23.7	Add to My Program
HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes

Wu, Ke	Fudan University
Zhang, Kaizhao	Fudan University
Zhang, Zhiwei	Fudan University
Tie, Muer	Fudan University
Yuan, Shanshuai	Fudan University
Zhao, Jieru	Shanghai Jiao Tong University
Gan, Zhongxue	Fudan University
Ding, Wenchao	Fudan University
Keywords: Mapping, RGB-D Perception, Sensor Fusion Abstract: Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and naviga- tion of autonomous vehicles. Recent advancements in dense mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping. However, integrating 3DGS into a street-view dense mapping framework still faces two challenges, including incom- plete reconstruction due to the absence of geometric information beyond the LiDAR coverage area and extensive computation for reconstruction in large urban scenes. To this end, we propose HGS-Mapping, an online dense mapping framework in unbounded large-scale scenes. To attain complete construction, our framework introduces Hybrid Gaussian Representation, which models different parts of the entire scene using Gaussians with distinct properties. Furthermore, we employ a hybrid Gaussian initialization mechanism and an adaptive update method to achieve high-fidelity and rapid reconstruction. To the best of our knowledge, we are the first to integrate Gaussian representation into online dense mapping of urban scenes. Our approach achieves SOTA reconstruction accuracy while only employing 66% number of Gaussians, leading to 20% faster reconstruction speed.


WeCT1 Regular Session, 302	Add to My Program
Award Finalists 7

Chair: Sukhatme, Gaurav	University of Southern California
Co-Chair: Althoefer, Kaspar	Queen Mary University of London

11:15-11:20, Paper WeCT1.1	Add to My Program
Distributed Multi-Robot Source Seeking in Unknown Environments with Unknown Number of Sources

Chen, Lingpeng	Chinese University of Hong Kong, Shenzhen
Kailas, Siva	Carnegie Mellon University
Deolasee, Srujan	Carnegie Mellon University
Luo, Wenhao	University of Illinois Chicago
Sycara, Katia	Carnegie Mellon University
Kim, Woojun	Carnegie Mellon University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Robust/Adaptive Control Abstract: We introduce a novel distributed source seeking framework, DIAS, designed for multi-robot systems in scenarios where the number of sources is unknown and potentially exceeds the number of robots. Traditional robotic source seeking methods typically focused on directing each robot to a specific strong source and may fall short in comprehensively identifying all potential sources. DIAS addresses this gap by introducing a hybrid controller that identifies the presence of sources and then alternates between exploration for data gathering and exploitation for guiding robots to identified sources. It further enhances search efficiency by dividing the environment into Voronoi cells and approximating source density functions based on Gaussian process regression. Additionally, DIAS can be integrated with existing source seeking algorithms. We compare DIAS with existing algorithms, including DoSS and GMES in simulated gas leakage scenarios where the number of sources outnumbers or is equal to the number of robots. The numerical results show that DIAS outperforms the baseline methods in both the efficiency of source identification by the robots and the accuracy of the estimated environmental density function.

11:20-11:25, Paper WeCT1.2	Add to My Program
Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding

Jiang, He	Carnegie Mellon University
Wang, Yutong	National University of Singapore
Veerapaneni, Rishi	Carnegie Mellon University
Duhan, Tanishq Harish	National University of Singapore
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Li, Jiaoyang	Carnegie Mellon University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Imitation Learning, Integrated Planning and Learning Abstract: Lifelong Multi-Agent Path Finding (LMAPF) repeatedly finds collision-free paths for multiple agents that are continually assigned new goals when they reach current ones. Recently, this field has embraced learning-based methods, which reactively generate single-step actions based on individual local observations. However, it is still challenging for them to match the performance of the best search-based algorithms, especially in large-scale settings. This work proposes an imitation-learning-based LMAPF solver that introduces a novel communication module as well as systematic single-step collision resolution and global guidance techniques. Our proposed solver, Scalable Imitation Learning for LMAPF (SILLM), inherits the fast reasoning speed of learning-based methods and the high solution quality of search-based methods with the help of modern GPUs. Across six large-scale maps with up to 10,000 agents and varying obstacle structures, SILLM surpasses the best learning- and search-based baselines, achieving average throughput improvements of 137.7% and 16.0%, respectively. Furthermore, SILLM also beats the winning solution of the 2023 League of Robot Runners, an international LMAPF competition. Finally, we validated SILLM with 10 real robots and 100 virtual robots in a mock warehouse environment.

11:25-11:30, Paper WeCT1.3	Add to My Program
Multi-Nonholonomic Robot Object Transportation with Obstacle Crossing Using a Deformable Sheet

Zhang, Weijian	University of Bimingham
Street, Charlie	University of Birmingham
Mansouri, Masoumeh	Birmingham University
Keywords: Multi-Robot Systems, Nonholonomic Motion Planning, Optimization and Optimal Control Abstract: In this paper, we address multi-robot formation planning where nonholonomic robots collaboratively transport objects using a deformable sheet in unstructured, cluttered environments. The formation can expand or contract to adjust the height of the object on the sheet. However, interactions between the robots and sheet introduce complex constraints for formation planning. Complexity increases further when the only feasible solution requires crossing an obstacle, i.e. where robots navigate in different homotopy classes around an obstacle such that the object hovers above it. Most existing nonholonomic formation planners do not admit obstacle crossing, limiting performance. In this paper, we present a two-stage iterative trajectory optimization framework which explicitly considers obstacle crossing. First, we capture the set of all feasible homotopy classes for each robot using a topological probabilistic roadmap. We then iteratively apply numerical optimization techniques to find a safe and feasible solution for the formation. We demonstrate the efficacy of our framework in simulation and on real robot hardware.

11:30-11:35, Paper WeCT1.4	Add to My Program
Configuration-Adaptive Visual Relative Localization for Spherical Modular Self-Reconfigurable Robots

Liu, Yuming	The Chinese University of Hong Kong, Shenzhen
Zheng, Qiu	The Chinese University of HongKong, Shenzhen
Tu, Yuxiao	The Chinese University of Hong Kong, Shenzhen
Gao, Yuan	Shenzhen Institute of Artificial Intelligence and Robotics for S
Liang, Guanqi	The Chinese University of Hong Kong, Shenzhen
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Keywords: Cellular and Modular Robots, Multi-Robot Systems, Localization Abstract: Spherical Modular Self-reconfigurable Robots (SMSRs) have been popular in recent years. Their Self-reconfigurable nature allows them to adapt to different environments and tasks, and achieve what a single module could not achieve. To collaborate with each other, relative localization between each module and assembly is crucial. Existing relative localization methods either have low accuracy, which is unsuitable for short-distance collaborations, or are designed for fixed-shape robots, whose visual features remain static over time. This paper proposes the first visual relative localization method for SMSRs. We first detect and identify individual modules of SMSRs, and adopt visual tracking to improve the detection and identification robustness. Using an optimization-based method, tracking result is then fused with odometry to estimate the relative pose between assemblies. To deal with the non-convexity of the optimization problem, we adopt semi-definite relaxation to transform it into a convex form. The proposed method is validated and analysed in real-world experiments. The overall localization performance and the performance under time-varying configuration are evaluated. The result shows that the relative position estimation accuracy reaches 2%, and the orientation estimation accuracy reaches 6.64^circ, and that our method surpasses the state-of-the-art methods.

11:35-11:40, Paper WeCT1.5	Add to My Program
Realm: Real-Time Line-Of-Sight Maintenance in Multi-Robot Navigation with Unknown Obstacles

Bai, Ruofei	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Li, Kun	Chongqing University
Guo, Hongliang	Agency for Science Technology and Research
Yau, Wei-Yun	I2R
Xie, Lihua	NanyangTechnological University
Keywords: Networked Robots, Constrained Motion Planning Abstract: Multi-robot navigation in complex environments relies on inter-robot communication and mutual observation for situational awareness. This paper studies the multi-robot navigation problem in unknown environments with line-of-sight (LoS) connectivity constraints. While previous works are limited to known environment models to derive the LoS constraints between robots, this paper eliminates such requirements by directly formulating the LoS constraints from real-time LiDAR scans, adopting techniques in point cloud visibility analysis. Based on that, we propose a novel LoS-distance metric to quantify both the urgency and sensitivity of losing LoS between robots considering their potential movements. Moreover, to address the imbalanced urgency of losing LoS between two robots, we design a fusion function to capture the overall urgency while generating gradients that facilitate robots' collaborative behavior to maintain LoS. The team connectivity is guaranteed by encoding the LoS constraints into a potential function that preserves the positivity of the Fiedler eigenvalue of robots' underlying graph. Finally, we establish a LoS-constrained exploration framework integrating the proposed connectivity controller. We showcase its applications in multi-robot exploration in complex unknown environments, where robots can always maintain the LoS connectivity through distributed sensing and communication while collaboratively exploring unknown environments. Our implementations are available at url{https://github.com/bairuofei/LoS_constrained_navigation}.


WeCT2 Regular Session, 301	Add to My Program
Interactive Robot Learning

Chair: Losey, Dylan	Virginia Tech
Co-Chair: Zhou, Bolei	University of California, Los Angeles

11:15-11:20, Paper WeCT2.1	Add to My Program
Personalizing Interfaces to Humans with User-Friendly Priors

Christie, Benjamin	Virginia Tech
Nemlekar, Heramb	Virginia Tech
Losey, Dylan	Virginia Tech
Keywords: Human-Robot Collaboration, Probabilistic Inference, Virtual Reality and Interfaces Abstract: Robots often need to convey information to human users. For example, robots can leverage visual, auditory, and haptic interfaces to display their intent or express their internal state. In some scenarios there are socially agreed upon conventions for what these signals mean: e.g., a red light indicates an autonomous car is slowing down. But as robots develop new capabilities and seek to convey more complex data, the meaning behind their signals is not always mutually understood: one user might think a flashing light indicates the autonomous car is an aggressive driver, while another user might think the same signal means the autonomous car is defensive. In this paper we enable robots to adapt their interfaces to the current user so that the human's personalized interpretation is aligned with the robot's meaning. We start with an information theoretic end-to-end approach, which automatically tunes the interface policy to optimize the correlation between human and robot. But to ensure that this learning policy is intuitive --- and to accelerate how quickly the interface adapts to the human --- we recognize that humans have priors over how interfaces should function. For instance, humans expect interface signals to be proportional and convex. Our approach biases the robot's interface towards these priors, resulting in signals that are adapted to the current user while still following social expectations. Our simulations and user study results across 15 participants suggest that these priors improve robot-to-human communication. See videos here: https://youtu.be/Re3OLg57hp8.

11:20-11:25, Paper WeCT2.2	Add to My Program
Personalization in Human-Robot Interaction through Preference-Based Action Representation Learning

Wang, Ruiqi	Purdue University
Zhao, Dezhong	Beijing University of Chemical Technology
Suh, Dayoon	Purdue University
Yuan, Ziqin	Purdue University
Chen, Guohua	Beijing University of Chemical Technology
Min, Byung-Cheol	Purdue University
Keywords: Human-Centered Robotics, Representation Learning, Human Factors and Human-in-the-Loop Abstract: Preference-based reinforcement learning (PbRL) has shown significant promise for personalization in human-robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches. Website at https://sites.google.com/view/pbarl.

11:25-11:30, Paper WeCT2.3	Add to My Program
Interface Matters: Comparing First and Third-Person Perspective Interfaces for Bi-Manual Robot Behavioural Cloning

Luo, Haining	Imperial College London
Chacon Quesada, Rodrigo	Imperial College London
Casado, Fernando E.	Imperial College London
Lingg, Nico	Imperial College London
Demiris, Yiannis	Imperial College London
Keywords: Virtual Reality and Interfaces, Bimanual Manipulation, Learning from Demonstration Abstract: Despite the growing interest in Behavioural Cloning for robots, few existing research has explicitly explored the impact of user interfaces on the effectiveness of expert demonstrations. We investigate the importance of user interface design in Behavioural Cloning, highlighting the critical role that interfaces play in conveying human demonstrations and robotics capabilities. This study compares the effectiveness of first and third-person perspective interfaces for robot shoe- lacing, a highly dexterous, bi-manual manipulation task that involves deformable objects and requires high precision. Our study highlights the importance of considering the impact of interface design on expert demonstration quality in Behavioural Cloning applications. By providing a first-person perspective, we observed significant differences in demonstration execution time and consistency compared to the third-person perspective. These findings suggest that the choice of interface can influence the quality of expert demonstrations, which in turn affects the performance of learning algorithms.

11:30-11:35, Paper WeCT2.4	Add to My Program
Robot Policy Transfer with Online Demonstrations: An Active Reinforcement Learning Approach

Hou, Muhan	Vrije University Amsterdam
Hindriks, Koen	Vrije Universiteit Amsterdam
Eiben, A.E.	VU Amsterdam
Baraka, Kim	Vrije Universiteit Amsterdam
Keywords: Human Factors and Human-in-the-Loop, Learning from Demonstration, Transfer Learning Abstract: Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.

11:35-11:40, Paper WeCT2.5	Add to My Program
User-Aware Collaborative Learning in Human-Robot Interactions

Gucsi, Bálint	University of Southampton
Tuyen, Nguyen Tan Viet	University of Southampton
Chu, Bing	University of Southampton
Tarapore, Danesh	University of Southampton
Tran-Thanh, Long	University of Warwick
Keywords: Social HRI, Human-Robot Teaming, Learning from Experience Abstract: Our work investigates how social robots can efficiently collaborate with human users in a user-aware manner, minimising the generated frustration in human colleagues, thus enhancing their experience. As part of this, we develop a user-aware framework for human-robot collaborative learning. We model users’ frustration during human-robot interactions based on recent interactions inspired by Psychological principles and develop different frustration-aware interactive preference learning and decision-making models using multi-armed bandit and knapsack methods. Evaluating our approach, 1) we conducted simulated experiments on realistic human-behaviour datasets and 2) a user-study in which participants worked with a TIAGo Steel humanoid robot on a collaboration task using frustration-aware and non frustration-aware (Upper Confidence Bounds and Instruction-based) models. We demonstrate that when collaborating with the frustration-aware robot, users completed the collaboration task 9.04% faster and using 20.54% less number of verbal interactions, with user questionnaire responses reporting less frustration experienced compared to the baseline approaches. Additionally, we create a multimodal dataset containing over 6 hours of human-robot interactions displaying various explicit and implicit user responses.

11:40-11:45, Paper WeCT2.6	Add to My Program
Data-Efficient Learning from Human Interventions for Mobile Robots

Peng, Zhenghao	University of California, Los Angeles
Liu, Zhizheng	SenseTime
Zhou, Bolei	University of California, Los Angeles
Keywords: Human Factors and Human-in-the-Loop, Reinforcement Learning, Learning from Demonstration Abstract: Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots---a legged quadruped, and a wheeled delivery robot---in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes {within 15 minutes}. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at: https://metadriverse.github.io/pvp4real/


WeCT3 Regular Session, 303	Add to My Program
Mechanism Design 3

Chair: Tadakuma, Kenjiro	Osaka University
Co-Chair: Sibai, Hussein	Washington University in St. Louis

11:15-11:20, Paper WeCT3.1	Add to My Program
A Morphing Quadrotor-Blimp with Balloon Failure Resilience for Mobile Ecological Sensing

Sharma, Suryansh	Delft University of Technology
Verhoeff, Mike	TU Delft
Joosen, Floor Elisabeth	Delft University of Technology
Venkatesha Prasad, RangaRao	Delft University of Technology
Hamaza, Salua	TU Delft
Keywords: Failure Detection and Recovery, Sensor Fusion, Aerial Systems: Mechanics and Control Abstract: The increasing popularity of helium-assisted blimps for extended monitoring or data collection applications is hindered by a critical limitation -- single-point failure when the balloon malfunctions or bursts. To address this, we introduce Janus, a hybrid blimp-drone platform equipped with integrated balloon failure detection and recovery capability. Janus employs a triggered mechanism that seamlessly transitions the platform from a blimp to a standard quad-rotor drone. Utilizing multiple sensors and fusing their readings, we have developed a robust balloon failure detection system. Janus demonstrates omnidirectional mobility in blimp mode and transitions promptly into quadrotor mode upon receiving the signal. Our results affirm the successful recovery of the system from balloon failure, with a rapid response time of 66ms to balloon failure detection. The drone morphs into a quadrotor and achieves recovery within 0.362 seconds in 90% of cases. By amalgamating the enduring flight capabilities of blimps with the agility of quad-rotors within a morphing platform like Janus, we cater to applications demanding both prolonged flight duration and enhanced agility.

11:20-11:25, Paper WeCT3.2	Add to My Program
A Novel Passive Parallel Elastic Actuation Principle for Load Compensation in Legged Robots

Zhang, Yifang	Istituto Italiano Di Tecnologia
Jiang, Jingcheng	Istituto Italiano Di Tecnologia
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Keywords: Mechanism Design, Actuation and Joint Mechanisms Abstract: This work introduces a novel parallel elastic actuation principle designed to provide torque compensation for legged robots. Unlike existing solutions, the proposed concept leverages a nitrogen N2 gas spring combined with a cam roller module to generate a highly customizable torque compensation profile for the target leg joint. An optimization-based design approach is employed to derive the specifications of the gas spring and optimize the cam module to produce a compensation torque profile closest to the desired one. The proposed load compensation concept and related mechanism are experimentally evaluated and practically integrated into the knee joint of a two-DoF monopedal robot actuated by cycloid actuators. The experimental results demonstrate that the proposed principle can effectively generate the required compensation torque profile and achieve significant benefits for the prototyped monopedal robot system by reducing 71.92% of the additional energy consumption caused by the payload. The entire system is compact, easy to integrate, and highly customizable, enabling the creation of nonlinear torque compensation profiles as needed. The work provides a promising solution to load compensation in legged robots.

11:25-11:30, Paper WeCT3.3	Add to My Program
Mathematical Modeling and Rolling Motion Generation of Planar Seven-Link Robot That Forms Passive Closed and Active Open Chains

Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Sedoguchi, Taiki	Japan Advanced Institute of Science and Technology
Tokuda, Isao T.	Ritsumeikan University
Keywords: Underactuated Robots, Mechanism Design, Motion Control Abstract: This paper investigates the mathematical modeling and basic motion properties of planar seven-link robots that forms passive closed and active open chains. The passive closed model is formed by connecting seven rigid frames via seven viscoelastic joints, and the active open model is formed by connecting them via actuated joints. The former is a convex heptagonal model and can exhibit passive-dynamic rolling on a gentle downhill, whereas the latter virtually forms a forward-leaning octagonal shape by controlling the six relative joint angles. In the first half of this paper, we describe the model assumptions and develop the mathematical equations of motion and collision of the passive closed model, and numerically analyze the motion characteristics by changing the slope angle while checking the conditions necessary for stable motion generation. In the second half, we outline the active open model, develop the PD control system, and numerically analyze the motion characteristics by changing the target angle parameter that controls the degree of forward lean of the virtual octagon.

11:30-11:35, Paper WeCT3.4	Add to My Program
LEVA: A High-Mobility Logistic Vehicle with Legged Suspension

Arnold, Marco	ETH Zürich
Hildebrandt, Lukas	ETH Zürich
Janssen, Kaspar	ETH Zürich
Ongan, Efe	Ethz - Rsl
Bürge, Pascal	ZHAW / Zurich University of Applied Sciences
Gábriel, Ádám Gyula	ETH Zürich
Kennedy, James	ETH Zürich
Lolla, Rishi	ETH Zurich
Oppliger, Quanisha	ZHAW Zurich University of Applied Sciences
Schaaf, Micha	ZHAW Zurich University of Applied Sciences
Church, Joseph	ETH RSL
Fritsche, Michael Xaver	ETH Zurich
Klemm, Victor	ETH Zurich
Tuna, Turcan	ETH Zurich, Robotic Systems Lab
Valsecchi, Giorgio	Robotic System Lab, ETH
Weibel, Cedric	ETH Zuerich
Hutter, Marco	ETH Zurich
Wüthrich, Michael	ZHAW Zurich University of Applied Sciences
Keywords: Field Robots, Mechanism Design, Legged Robots Abstract: The autonomous transportation of materials over challenging terrain is a challenge with major economic implications and remains unsolved. This paper introduces LEVA, a high-payload, high-mobility robot designed for autonomous logistics across varied terrains, including those typical in agriculture, construction, and search and rescue operations. LEVA uniquely integrates an advanced legged suspension system using parallel kinematics. It is capable of traversing stairs using a rl controller, has steerable wheels, and includes a specialized box pickup mechanism that enables autonomous payload loading as well as precise and reliable cargo transportation of up to 85 kg across uneven surfaces, steps and inclines while maintaining a cot of as low as 0.15. Through extensive experimental validation, LEVA demonstrates its off-road capabilities and reliability regarding payload loading and transport.

11:35-11:40, Paper WeCT3.5	Add to My Program
Safe Decentralized Multi-Agent Control Using Black-Box Predictors, Conformal Decision Policies, and Control Barrier Functions

Huriot, Sacha	Washington University in St. Louis
Sibai, Hussein	Washington University in St. Louis
Keywords: Robust/Adaptive Control, Robot Safety, Machine Learning for Robot Control Abstract: We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.

11:40-11:45, Paper WeCT3.6	Add to My Program
Poloidal Drive: Direct-Drive Transmission Mechanism for Active Omni-Wheels with Spoke Interference Avoidance

Sano, Shunsuke	Osaka University
Tadakuma, Kenjiro	Osaka University
Kayawake, Ryotaro	Tohoku University
Watanabe, Masahiro	Osaka University
Abe, Kazuki	Osaka University
Kemmotsu, Yuto	Tohoku University
Tadokoro, Satoshi	Tohoku University
Keywords: Mechanism Design, Wheeled Robots Abstract: Wheels require extra space for steering. Omnidirectional wheels are ideal for confined spaces as they can move in all directions: forward/backward and left/right. Conventional omnidirectional wheels with passive rollers achieve this movement by combining multiple wheels. However, if even one wheel loses contact with the ground, the vehicle becomes inoperable. To overcome this limitation, omnidirectional wheels with actively driven rollers have been proposed. These designs, however, require additional components, which increase weight. This is because multi-step intermediate transmission mechanisms are needed to convert spindle rotation into roller rotation. Eliminating the intermediate transmission mechanism reduces the number of components and provides more space to enhance wheel strength. This study proposed a mechanism without intermediate transmission, clarified its design framework, and experimentally demonstrated its feasibility as an active omnidirectional wheel. The proposed design framework defines conditions to maximize both power transmission efficiency and strength. Experimental results showed that the transmission efficiency of the proposed mechanism is comparable to that of conventional mechanisms.


WeCT4 Regular Session, 304	Add to My Program
Sensor Fusion 2

Chair: Albini, Alessandro	University of Oxford
Co-Chair: Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Texas A&M University (TAMU)

11:15-11:20, Paper WeCT4.1	Add to My Program
An End-To-End Learning-Based Multi-Sensor Fusion for Autonomous Vehicle Localization

Lin, Changhong	DiDi Autonomous Driving
Lin, Jiarong	The University of Hong Kong
Sui, Zhiqiang	University of Michigan
Qu, Xiaozhi	Didichuxing
Wang, Rui	DiDi Autonomous Driving
Sheng, Kehua	DIdi Inc
Zhang, Bo	DIdi Inc
Keywords: Localization, Sensor Fusion Abstract: Multi-sensor fusion is essential for autonomous vehicle localization, as it is capable of integrating data from various sources for enhanced accuracy and reliability. The accuracy of the integrated location and orientation depends on the precision of the uncertainty modeling. Traditional methods of uncertainty modeling typically assume a Gaussian distribution and involve manual heuristic parameter tuning. However, these methods struggle to scale effectively and address long-tail scenarios. To address these challenges, we propose a learning-based method that encodes sensor information using higher-order neural network features, thereby eliminating the need for uncertainty estimation. This method significantly eliminates the need for parameter fine-tuning by developing an end-to-end neural network that is specifically designed for multi-sensor fusion. In our experiments, we demonstrate the effectiveness of our approach in real-world autonomous driving scenarios. Results show that the proposed method outperforms existing multi-sensor fusion methods in terms of both accuracy and robustness. A video of the results can be viewed at https://youtu.be/q4iuobMbjME.

11:20-11:25, Paper WeCT4.2	Add to My Program
Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

Wolters, Philipp	Technical University of Munich
Gilg, Johannes	Technical University of Munich
Teepe, Torben	Technical University of Munich
Herzog, Fabian	Technical University of Munich
Laouichi, Anouar	Technical University of Munich
Hofmann, Martin	Fusionride Technology (Germany) GmbH
Rigoll, Gerhard	Technische Universität München
Keywords: Sensor Fusion, Semantic Scene Understanding, Object Detection, Segmentation and Categorization Abstract: Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense Bird's-Eye-View (BEV)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a adar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.

11:25-11:30, Paper WeCT4.3	Add to My Program
VIP-Dock: Vision, Inertia, and Pressure Sensor Fusion for Underwater Docking with Optical Beacon Guidance

Zhang, Suohang	Zhejiang University
Qian, Shipang	Zhejiang University
Wang, Lu	Zhejiang University
Fei, Xinyu	Zhejiang University
Chen, Yanhu	Zhejiang University
Keywords: Sensor Fusion, Marine Robotics, Sensor-based Control Abstract: Underwater docking enhances the operational capabilities of Autonomous Underwater Vehicles (AUVs) by facilitating energy and data transfer. Optical beacons serve as the primary guidance method for AUVs to localize and track docking stations. This paper presents VIP-Dock, a novel optical beacon tracking algorithm for robust underwater docking of AUVs. VIP-Dock addresses the challenge of maintaining accurate beacon tracking under visual interference by integrating visual, inertial, and pressure perception. Employing an unscented Kalman filter framework, the VIP-Dock algorithm provides continuous optimal estimation of beacon positions. Experimental results demonstrated VIP-Dock's real-time tracking performance in actual docking scenarios and its ability to maintain accuracy during visual input failure. Implementation in a digital twin system for an underwater vertical shuttle showed significant improvement, increasing docking success rates from 62% to 84% across 100 trials under simulated current disturbances.

11:30-11:35, Paper WeCT4.4	Add to My Program
Heterogeneous Sensor Fusion and Active Perception for Transparent Object Reconstruction with a PDM^2 Sensor and a Camera

Guo, Fengzhi	Texas A&M University
Xie, Shuangyu	Texas A&M University
Wang, Di	Texas A&M University
Fang, Cheng	Texas A&M University
Zou, Jun	Texas A&M University
Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Keywords: Sensor Fusion, Perception for Grasping and Manipulation Abstract: Transparent household objects present a challenge for domestic service robots, since neither regular cameras nor RGB-D cameras can provide accurate points for shape reconstruction. The new type of pretouch dual-modality distance and material sensor (PDM^2) can provide reliable and accurate depth readings, but it is a point sensor and scanning the object exclusively with the sensor is too inefficient. Hence, we present a sensor fusion approach by combining a regular camera with the PDM^2 sensor. The approach is based on a data fusion algorithm for shape reconstruction and an active perception algorithm for scan planning for the PDM^2 sensor. The data fusion algorithm is a distributed Gaussian process (GP)-based shape reconstruction method that allows for incremental local update to reduce computational time. The active perception algorithm is an optimization-based approach by increasing the information gain (IG) and prioritizing the boundary points under a preset travel distance constraint. We have implemented and tested the algorithms with six different transparent household items. The results show satisfactory shape reconstruction results in all test cases with an average increase in intersection over union (IoU) from 0.73 to 0.96.

11:35-11:40, Paper WeCT4.5	Add to My Program
DA-Fusion: Deformable Attention-Based RGB-D Fusion Transformer for Unseen Object Instance Segmentation

Park, Yesol	Seoul National University
Yoon, Hye Jung	Seoul National University
Kim, Juno	Seoul National University
Zhang, Byoung-Tak	Seoul National University
Keywords: Logistics, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: In logistics automation, accurately segmenting unseen objects is essential for tasks such as bin picking, shelf picking, and warehouse sorting, which involve complex and cluttered environments. Traditional RGB-based methods tend to over-segment objects due to their reliance on texture, while depth-based methods often under-segment by focusing primarily on geometric features. To address these limitations, we propose DA-Fusion, a deformable attention-based RGB-D fusion Transformer designed for unseen object instance segmentation. DA-Fusion effectively combines the strengths of both RGB and depth data, enhancing segmentation accuracy in cluttered and multi-layered object environments. We also introduce the Object Clutter Bin Dataset (OCBD), a benchmark dataset specifically tailored for evaluating bin-picking scenarios in top-down views. Extensive evaluations demonstrate that DA-Fusion outperforms state-of-the-art methods across diverse environments, making it particularly suited for real-world logistics tasks.

11:40-11:45, Paper WeCT4.6	Add to My Program
PAIR360: A Paired Dataset of High-Resolution 360˚ Panoramic Images and LiDAR Scans

Kim, Geunu	Kyung Hee University
Kim, Daeho	Kyung Hee University
Jang, Jaeyun	Kyung Hee University
Hwang, Hyoseok	Kyung Hee University
Keywords: Data Sets for SLAM, Sensor Fusion, Omnidirectional Vision Abstract: The 360˚ camera is a compact omnidirectional perception system for capturing panoramic images with the same field of view as LiDAR. This boosts its versatility for use in autonomous driving and robotics. However, most existing datasets of 360˚ panoramic images primarily focus on indoor or virtual environments, or they offer only low-resolution outdoor images and LiDAR configurations. In this letter, we present PAIR360, a multi-modal dataset encompassing high-resolution 360˚ camera images and 3D LiDAR scans, aimed at stimulating research in computer vision. To this end, we collected a comprehensive dataset at Kyung Hee University Global Campus, capturing 52 sequences from 7 different areas under diverse atmospheric conditions, including sunny, cloudy, and sunrise. The dataset features 8K resolution panoramic imagery, six fisheye images, point clouds, GPS, and IMU data, all synchronized using LiDAR timestamps and calibrated across visual sensors. We also provide additional data, such as depth maps, segmentation, and 3D maps, to demonstrate the feasibility of our dataset and its application to various computer vision tasks. The dataset is available for download at: https://airlabkhu.github.io/PAIR-360-Dataset/


WeCT5 Regular Session, 305	Add to My Program
Aerial Manipulation 2

Chair: Katzschmann, Robert Kevin	ETH Zurich
Co-Chair: Panetsos, Fotis	New York University Abu Dhabi

11:15-11:20, Paper WeCT5.1	Add to My Program
NDOB-Based Control of a UAV with Delta-Arm Considering Manipulator Dynamics

Chen, Hongming	Sun Yat-Sen University
Ye, Biyu	Sun Yat-Sen University
Liang, Xianqi	Sun Yat-Sen University
Deng, Weiliang	Sun Yat-Sen University
Lyu, Ximin	Sun Yat-Sen University
Keywords: Aerial Systems: Mechanics and Control Abstract: Aerial Manipulators (AMs) provide a versatile platform for various applications, including 3D printing, architecture, and aerial grasping missions. However, their operational speed is often sacrificed to uphold precision. Existing control strategies for AMs often regard the manipulator as a disturbance and employ robust control methods to mitigate its influence. This research focuses on elevating the precision of the end effector and enhancing the agility of aerial manipulator movements. We present a composite control scheme to address these challenges. Initially, a Nonlinear Disturbance Observer (NDOB) is utilized to compensate for internal coupling effects and external disturbances. Subsequently, manipulator dynamics are processed through a high pass filter to facilitate agile movements. By integrating the proposed control method into a fully autonomous delta-arm-based AM system, we substantiate the controller's efficacy through extensive real-world experiments. The outcomes illustrate that the end-effector can achieve accuracy at the millimeter level.

11:20-11:25, Paper WeCT5.2	Add to My Program
Flapping-Wing Flying Robot with Integrated Dual-Arm Scissors-Type Flora Sampling System

Gordillo Durán, Rodrigo	Universidad De Sevilla
Tapia, Raul	University of Seville
Rafee Nekoo, Saeed	GRVC Robotics Lab, Universidad De Sevilla
Martinez-de Dios, J.R.	University of Seville
Ollero, Anibal	AICIA. G41099946
Keywords: Aerial Systems: Applications, Mechanism Design, Computer Vision for Automation Abstract: The flapping-wing robotic birds were inspired by nature to present an alternative way of thrust and lift generation instead of conventional high-speed rotary propellers in unmanned aerial platforms. The advances in flapping technology recently led to the prototyping of leg-claw mechanisms for perching and occasionally very lightweight arms for sampling or tiny object aerial manipulation. A dual-arm manipulator on top of a robotic bird might not be bio-inspired and safe in case of a collision with the environment or human-robot interaction. Here in this work, the previously designed dual-arm scissors-type manipulator has been improved in terms of workspace, mechanism, vision system, and blade placement to present a more natural way of sampling. The new dual-arm, with 100.2(g) weight, is redesigned inside a beak to have protection against possible collisions and also secure the cutting blades within a protected shield. During the flight, the dual-arm system is inside the cover and invisible; the lower beak is opened before manipulation and sets out the arm in a proper place for sampling. This new safety cover (beak) along with the new blade mechanism enhanced the cutting power and the safety of the operation. The experimental results show the successful cutting of a series of plant samples.

11:25-11:30, Paper WeCT5.3	Add to My Program
Reliable Aerial Manipulation: Combining Visual Tracking with Range Sensing for Robust Grasping

Blöchlinger, Marc	ETHZ
Toshimitsu, Yasunori	ETH Zurich
Katzschmann, Robert Kevin	ETH Zurich
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Mobile Manipulation Abstract: Reliable object localization is a critical challenge in drone-based aerial manipulation, particularly when objects are outside the camera's field of view. This paper presents a new approach to enhance drone reliability in aerial grasping tasks by integrating a 1D time-of-flight range sensor with a vision-based localization system. The range sensor, positioned beneath the drone, generates a detailed point cloud of the ground beneath the drone, allowing for precise object localization even when the drone hovers directly above the target. By combining visual tracking with real-time distance measurements, our system achieves a 96% grasp success rate across 128 trials with diverse objects, representing a significant improvement over previous approaches. This method enables zero-shot grasping without prior knowledge of the objects, increasing versatility and robustness in complex, unstructured environments. The open-source software and hardware design of the platform provides a foundation for further research and development in the field of autonomous aerial manipulation.

11:30-11:35, Paper WeCT5.4	Add to My Program
Safety-Critical Control for Aerial Physical Interaction in Uncertain Environment

Byun, Jeonghyun	Seoul National University
Kim, Yeonjoon	Seoul National University
Lee, Dongjae	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Aerial Systems: Mechanics and Control, Robot Safety, Robust/Adaptive Control Abstract: Aerial manipulation for safe physical interaction with their environments is gaining significant momentum in robotics research. In this paper, we present a disturbance-observer-based safety-critical control for a fully actuated aerial manipulator interacting with both static and dynamic structures. Our approach centers on a safety filter that dynamically adjusts the desired trajectory of the vehicle's pose, accounting for the aerial manipulator's dynamics, the disturbance observer's structure, and motor thrust limits. We provide rigorous proof that the proposed safety filter ensures the forward invariance of the safety set—representing motor thrust limits—even in the presence of disturbance estimation errors. To demonstrate the superiority of our method over existing control strategies for aerial physical interaction, we perform comparative experiments involving complex tasks, such as pushing against a static structure and pulling a plug firmly attached to an electric socket. Furthermore, to highlight its repeatability in scenarios with sudden dynamic changes, we perform repeated tests of pushing a movable cart and extracting a plug from a socket. These experiments confirm that our method not only outperforms existing methods but also excels in handling tasks with rapid dynamic variations.

11:35-11:40, Paper WeCT5.5	Add to My Program
SPIBOT: A Drone-Tethered Mobile Gripper for Robust Aerial Object Retrieval in Dynamic Environments

Kang, Gyuree	Korea Advanced Institute of Science and Technology (KAIST)
Guenes, Ozan	Korea Advanced Institute of Science and Technology
Lee, Seungwook	Korea Advanced Institute of Science and Technology
Azhari, Maulana Bisyir	Korea Advanced Institute of Science and Technology
Shim, David Hyunchul	KAIST
Keywords: Aerial Systems: Applications, Field Robots, Marine Robotics Abstract: In real-world field operations, aerial grasping systems face significant challenges in dynamic environments due to strong winds, shifting surfaces, and the need to handle heavy loads. Particularly when dealing with heavy objects, the powerful propellers of the drone can inadvertently blow the target object away as it approaches, making the task even more difficult. To address these challenges, we introduce SPIBOT, a novel drone-tethered mobile gripper system designed for robust and stable autonomous target retrieval. SPIBOT operates via a tether, much like a spider, allowing the drone to maintain a safe distance from the target. To ensure both stable mobility and secure grasping capabilities, SPIBOT is equipped with six legs and sensors to estimate the robot's and mission's states. It is designed with a reduced volume and weight compared to other hexapod robots, allowing it to be easily stowed under the drone and reeled in as needed. Designed for the 2024 MBZIRC Maritime Grand Challenge, SPIBOT is built to retrieve a 1kg target object in the highly dynamic conditions of the moving deck of a ship. This system integrates a real-time action selection algorithm that dynamically adjusts the robot's actions based on proximity to the mission goal and environmental conditions, enabling rapid and robust mission execution. Experimental results across various terrains, including a pontoon on a lake, a grass field, and rubber mats on coastal sand, demonstrate SPIBOT's ability to efficiently and reliably retrieve targets. SPIBOT swiftly converges on the target and completes its mission, even when dealing with irregular initial states and noisy information introduced by the drone.

11:40-11:45, Paper WeCT5.6	Add to My Program
GP-Based NMPC for Aerial Transportation of Suspended Loads

Panetsos, Fotis	New York University Abu Dhabi
Karras, George	University of Thessaly
Kyriakopoulos, Kostas	New York University - Abu Dhabi
Keywords: Aerial Systems: Applications, Field Robots, Motion Control Abstract: In this work, we leverage Gaussian Processes (GPs) and present a learning-based control scheme for the transportation of cable-suspended loads with multirotor Unmanned Aerial Vehicles (UAVs). Our ultimate goal is to approximate the model discrepancies that exist between the actual and nominal system dynamics. Towards this direction, weighted and sparse Gaussian Process (GP) regression is exploited so as to approximate online the model errors and guarantee real-time performance while also ensuring adaptability to the conditions prevailing in the outdoor environment where the UAV is deployed. The learned model errors are fed into a nonlinear Model Predictive Controller (NMPC), formulated for the corrected system dynamics, which achieves the transportation of the UAV towards reference positions with simultaneous minimization of the cable angular motion, regardless of the outdoor conditions and the existence of external disturbances, primarily stemming from the unknown wind. The proposed scheme is validated through simulations and real-world experiments with an octorotor, demonstrating an 80% reduction in the steady-state position error under 4 Beaufort wind conditions compared to the nominal NMPC.


WeCT6 Regular Session, 307	Add to My Program
Vision-Based Navigation 3

Chair: Salek Shahrezaie, Roya	University of Nevada, Reno
Co-Chair: Chang, Yan	Nvidia

11:15-11:20, Paper WeCT6.1	Add to My Program
Knowledge-Driven Visual Target Navigation: Dual Graph Navigation

Li, Shiyao	Dalian University of Technology
Meng, Ziyang	Dalian University of Technology
Pei, JianSong	Dalian University of Technology
Chen, Jiahao	Institute of Automation, Chinese Academy of Sciences
Dong, BingCheng	Dalian University of Technology
Li, Guangsheng	Dalian University of Technology
Liu, Shenglan	Dalian University of Technology
Wang, Feilong	Dalian University of Technology
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Robotics in Under-Resourced Settings Abstract: In unknown environments, navigating a robot by a given image to a specific location or instance is critical and challenging. The existing end-to-end approaches require simultaneous implicit learning of multiple subtasks, and modular approaches depend on metric information. Both approaches face high computational demands, often leading to difficulties in real-time updates and limited generalization, making them challenging to implement on resource-constrained devices. To address these challenges, we propose Dual Graph Navigation (DGN), a knowledge-driven, lightweight image instance navigation framework. DGN builds an External Knowledge Graph (EKG) from small-scale datasets to capture prior object correlations, efficiently guiding target exploration. During exploration, DGN builds an Internal Knowledge Graph (IKG) using an instance-aware module, which records explored objects based on reachability relationships rather than precise metric information. The IKG dynamically updates the EKG, enhancing the robot's adaptability to the current environment. Together, they realize topological perception and reduce computational overhead. Furthermore, unlike approaches characterized by over-dependence between components, DGN employs a plug-and-play modular design that allows independent training and flexible replacement of functional modules, effectively enhancing generalization performance while reducing training and deployment costs. Experiments illustrate that DGN generalizes well in different simulation environments (AI2-THOR, Habitat), achieving state-of-the-art performance on the ProcTHOR-10K dataset. It is compatible with three distinct real-world robot platforms, including edge computing devices without CUDA support. It exhibits a decision-making speed of 3.8 to 5.5 times over baseline methods. Further details can be found on the project page:https://dogplanningloyo.github.io/DGN/

11:20-11:25, Paper WeCT6.2	Add to My Program
Learning to Predict the Future from Monocular Vision for Efficient Human-Aware Navigation

Huang, Yushuang	Institute of Computing Technology, Chinese Academy of Sciences
Jiang, Hao	Institute of Computing Technology, Chinese Academy of Sciences
Liu, Zihan	Institute of Computing Technology of the Chinese Academy of Scie
Ouyang, Wanli	The University of Sydney
Wang, Zhaoqi	Institute of Computing Technology, the Chinese Academy of Scienc
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, AI-Enabled Robotics Abstract: Human-aware navigation (HAN) aims to build autonomous agents that robustly and naturally navigate in human-centered environments. Due to the complex and dynamic nature of this task, existing approaches typically rely on sophisticated pipelines that separately process perception and decision-making to solve it. In this work, we propose an Obstruction Distance Vector based End-to-End Model (ODVEEM), using monocular vision for navigation around humans. The Obstruction Distance Vector (ODV) is an intermediate representation in our model, leveraged to describe the Obstruction Distance to the first future collision in all possible directions in the horizontal field of view. As ODV cannot be calculated directly in the real world, we design a neural network for ODV estimation, formulating it as a classification problem with auxiliary proxy tasks, which play a key role in effectively predicting the implicit future motion of nearby humans. Taking advantage of ODV, ODVEEM supervised by human behavioral heuristics is employed to guide the agent to reach a goal efficiently and avoid potential collisions. Several challenging experiments show our method's substantial improvement over a number of baseline methods, attaining solid performance with zero-shot transfer to unseen simulated and real-world environments.

11:25-11:30, Paper WeCT6.3	Add to My Program
DP-Habitat: Bridging the Gap between Simulation and Reality for Visual Navigation in Dynamic Pedestrian Environments

Qin, Liang	University of Science and Technology of China
Wang, Min	Hefei Comprehensive National Science Center
Wang, Haodong	University of Science and Technology of China
Zhou, Wengang	University of Science and Technology of China
Li, Houqiang	University of Science and Technology of China
Keywords: Vision-Based Navigation Abstract: Visual navigation in dynamic environments poses a considerable challenge, particularly in scenarios with diverse pedestrian behaviors. Traditional simulators primarily focus on static scenes, while existing dynamic pedestrian simulators often suffer limitations such as monotonous pedestrian models, lack of interaction with the environment, and constrained scenarios. These deficiencies lead to notable discrepancies from real-world dynamic pedestrian environments. To bridge this gap, we introduce DP-Habitat, a dynamic pedestrian simulator developed on the Habitat platform. DP-Habitat efficiently simulates a wide range of complex and realistic human behaviors, with flexible interactions between pedestrian models and environments. It also supports rapid deployment of pedestrian models across various scenes, thereby more accurately replicating the complexities of real-world dynamic pedestrian settings. Additionally, we present Adaptive Object Navigation with Dynamic Mapping (AON-DM), a novel baseline method specifically designed for dynamic pedestrian settings. AON-DM integrates real-time pedestrian tracking and predictive modeling with a hybrid path planning strategy, markedly improving navigation efficiency and success rates. Our experimental results reveal that dynamic pedestrians significantly affect visual navigation performance within DP-Habitat, with AON-DM achieving superior effectiveness compared to existing methods under these challenging conditions. Furthermore, our approach maintains high performance in real-world scenarios, highlighting its practical applicability and robustness. The code and data are available at url{https://github.com/qinliangql/DP-Habitat.git}.

11:30-11:35, Paper WeCT6.4	Add to My Program
X-MOBILITY: End-To-End Generalizable Navigation Via World Modeling

Liu, Wei	Nvidia
Zhao, Huihua	Georgia Tech
Li, Chenran	University of California, Berkeley
Biswas, Joydeep	University of Texas at Austin
Okal, Billy	University of Freiburg
Goyal, Pulkit	Nvidia
Chang, Yan	Nvidia
Pouya, Soha	Stanford University
Keywords: Vision-Based Navigation, Learning from Demonstration, Probabilistic Inference Abstract: General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces xmobility{}, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, xmobility{} employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies—off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that xmobility{} not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, xmobility{} also achieves zero-shot Sim2Real transferability and shows strong potential for cross-embodiment generalization. Project page: https://nvlabs.github.io/X-MOBILITY

11:35-11:40, Paper WeCT6.5	Add to My Program
Map-SemNav: Advancing Zero-Shot Continuous Vision-And-Language Navigation through Visual Semantics and Map Integration

Wu, Shuai	Tianjin University
Liu, Ruonan	Shanghai Jiao Tong University
Xie, Zongxia	Tianjin University
Pang, Zhibo	KTH Royal Institute of Technology
Keywords: Vision-Based Navigation, Agent-Based Systems, Autonomous Agents Abstract: This paper explores zero-shot Vision-and-Language Navigation (VLN), enabling agents to generalize navigation to unseen data classes. Most current approaches rely on large models, but these are not specifically tailored for VLN, lacking direct learning from navigation environments and slowing down agents due to their overwhelming size. To tackle this, we propose Map-Semantic Zero-shot Navigation (Map-SemNav), which does not rely on large models for navigation planning. Map-SemNav utilizes three key cues: direction, object, and scene, to acquire relational knowledge instead of memorizing specific classes, which enables generalization to unseen data. Direction is guided by a top-down semantic map, while object and scene information is decoupled from environment knowledge. Extensive experiments demonstrate that Map-SemNav outperforms state-of-the-art large model-based methods in zero-shot VLN tasks within continuous environments, while also offering higher efficiency due to its simplified architecture.

11:40-11:45, Paper WeCT6.6	Add to My Program
Safer Gap: Safe Navigation of Planar Nonholonomic Robots with a Gap-Based Local Planner

Feng, Shiyu	Georgia Institute of Technology
Abuaish, Ahmad	Georgia Institute of Technology
Vela, Patricio	Georgia Institute of Technology
Keywords: Vision-Based Navigation, Collision Avoidance, Reactive and Sensor-Based Planning Abstract: This paper extends the gap-based navigation technique Potential Gap with safety guarantees at the local planning level for a kinematic planar nonholonomic robot model, leading to Safer Gap. It relies on a subset of navigable free space from the robot to a gap, denoted the keyhole region. The region is defined by the union of the largest collision-free disc centered on the robot and a collision-free trapezoidal region directed through the gap. Safer Gap first generates Bezier-based collision-free paths within the keyhole regions. The keyhole region of the top scoring path is encoded by a shallow neural network-based zeroing barrier function (ZBF) synthesized in real-time. Nonlinear Model Predictive Control (NMPC) with Keyhole ZBF constraints and output tracking of the Bezier path, synthesizes a safe kinematically feasible trajectory. The Potential Gap projection operator serves as a last action to enforce safety if the NMPC optimization fails to converge to a solution within the prescribed time. Simulation and experimental validation of Safer Gap confirm its collision-free navigation properties.


WeCT7 Regular Session, 309	Add to My Program
Marine Robotics 4

Chair: Guo, Yi	Stevens Institute of Technology
Co-Chair: Ohrem, Sveinung Johan	SINTEF Ocean AS

11:15-11:20, Paper WeCT7.1	Add to My Program
Bathymetric Surveying with Imaging Sonar Using Neural Volume Rendering

Xie, Yiping	Linköping University
Troni, Giancarlo	Monterey Bay Aquarium Research Institute
Bore, Nils	KTH Royal Institute of Technology
Folkesson, John	KTH
Keywords: Marine Robotics, Mapping, Deep Learning Methods Abstract: This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.

11:20-11:25, Paper WeCT7.2	Add to My Program
Diver to Robot Communication Underwater

Codd-Downey, Robert	York University
Jenkin, Michael	York University
Keywords: Marine Robotics, Gesture, Posture and Facial Expressions, Human-Robot Collaboration Abstract: Gesture-based communication is a standard underwater communication strategy that is taught to divers as part of their regular diver training and it would seem a natural mechanism to leverage for diver to robot communication underwater. Enabling an unmanned underwater vehicle (UUV) to understand such sequences would involve having the robot learn the large set of gestures that divers use and the way they are combined. As perfect transcription of gestures is unlikely, the communication process also requires an error-correcting framework to ensure that communication is clear and correct. Here we describe an interactive process that provides this infrastructure. A weakly supervised transfer learning approach is used to recognize standard SCUBA gestures in individual video frames and within a Sim2Real process to train a LSTM to recognize gesture sequences. This process is placed within a per-gesture and per-sequence interaction process to assist and confirm the recognition of individual gestures and to confirm entire gesture sequences. Individual aspects of this process and complete end-to-end operation are demonstrated using an unmanned underwater vehicle.

11:25-11:30, Paper WeCT7.3	Add to My Program
SIMP: Energy and Time-Efficient Real-Time 3D Motion Planning for Bio-Inspired AUVs

Bjørlo, August Sletnes	NTNU
Xanthidis, Marios	SINTEF Ocean
Føre, Martin	NTNU
Kelasidi, Eleni	NTNU
Keywords: Marine Robotics, Collision Avoidance, Biologically-Inspired Robots Abstract: Underwater navigation is an area of increasing research interest due to its fundamental complexity and industrial applications. Though, due to convenience and current theoretical understanding, the vast majority of underwater platforms utilize thrusters, while other forms of propulsion, such as undulation locomotion, have been given limited exposure. This paper provides the first real-time motion planning framework that produces energy and time efficient paths with empirical local optimality for articulated swimming robots in 3D, called SIMP. SIMP utilizes learned associations between parameterized dynamically feasible undulatory gaits with their expected energy cost, velocity, and swept-out volume of the robot during execution, to formulate a simplified optimization problem that decides the path to be followed with the corresponding consecutive gaits, and navigates the robot safely in complex 3D environments. The proposed pipeline is tested in numerical experiments with realistic dynamics for a 10-link underwater snake robot (USR) with anguilliform gaits, in simulated cluttered environments of significant challenge, displaying real-time replanning performance of more than 1 Hz.

11:30-11:35, Paper WeCT7.4	Add to My Program
End-To-End Underwater Multi-View Stereo for Dense Scene Reconstruction

Yang, Guidong	The Chinese University of Hong Kong
Wen, Junjie	The Chinese University of Hong Kong
Zhao, Benyun	The Chinese University of Hong Kong
Li, Qingxiang	The Chineses University of Hong Kong
Huang, Yijun	The Chinese University of Hong Kong
Lei, Lei	City University of Hong Kong
Chen, Xi	The Chinese University of Hong Kong
Lam, Alan Hiu-Fung	The Chinese University of Hong Kong,
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Marine Robotics, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: Recent advancements in learning-based multi-view stereo (MVS) have demonstrated significant improvements over traditional counterpart, primarily due to the extensive availability of multi-view training images with ground-truth metric depths in the terrestrial in-air domain. However, underwater multi-view stereo (UwMVS) faces substantial challenges arising from the domain gap between in-air and underwater environments, leading to degraded performance when applying in-air MVS models to underwater scenarios. Furthermore, the progress of learning-based UwMVS methods has been hindered by the scarcity of underwater multi-view images with ground-truth depth maps and point clouds. In this paper, we address these challenges by introducing a physically-guided approach for synthesizing underwater multi-view images and present the first large-scale UwMVS dataset for end-to-end training and evaluation of learning-based UwMVS methods. Furthermore, we propose a novel UwMVS network that enhances geometric cue encoding to achieve more accurate and complete point cloud reconstruction. Extensive experiments on our dataset and real-world underwater scenes demonstrate that our dataset enables the trained models for underwater dense reconstruction and that our method achieves state-of-the-art performance in underwater reconstruction. Dataset, code and appendix are available at: https://cuhk-usr-group.github.io/UwMVS/

11:35-11:40, Paper WeCT7.5	Add to My Program
UR-MVO: Robust Monocular Visual Odometry for Underwater Scenarios

Barhoum, Zein Alabedeen	ITMO University
Maalla, Yazan	ITMO University
Daher, Sulieman	ITMO University
Topolnitskii, Alexander	ITMO University
Mahmoud, Jaafar	ITMO University
Kolyubin, Sergey	ITMO University
Keywords: Marine Robotics, Localization, Object Detection, Segmentation and Categorization Abstract: Visual odometry (VO) in underwater environments presents significant challenges due to poor visibility and dynamic scene changes, which render conventional (in-air) VO solutions unsuitable for underwater applications. We propose an underwater robust monocular visual odometry (UR-MVO) pipeline tailored for underwater scenarios with feature extraction and matching based on SuperPoint and SuperGlue models, respectively. We enhance the robustness of the feature extractor through field-specific fine-tuning of the SuperPoint model using few-shot unsupervised learning. This tuning was done on real images of underwater scenes in order to enhance its performance in the harsh underwater image conditions. Moreover, we integrate semantic segmentation trained on underwater images into our pipeline to eliminate unreliable features belonging to dynamic objects and background. We evaluated the proposed solution on the Aqualoc dataset, demonstrating higher localization accuracy compared to other SOTA direct and feature-based monocular VO methods like DSO and SVO and also obtained very competitive results compared to more resource-intensive monocular VSLAM approaches with loop closure process like LDSO, UVS, and ORB-SLAM. The results show a high potential for our approach for further applications in underwater exploration and mapping using affordable sensory setups. We publish the code for the benefit of the community https://github.com/be2rlab/UR-MVO

11:40-11:45, Paper WeCT7.6	Add to My Program
SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model

Yang, Daniel	Massachusetts Institute of Technology
Leonard, John	MIT
Girdhar, Yogesh	Woods Hole Oceanographic Institution
Keywords: Marine Robotics, Representation Learning, Deep Learning for Visual Perception Abstract: We introduce SeaSplat, a method to enable real-time rendering of underwater scenes leveraging recent advances in 3D radiance fields. Underwater scenes are challenging visual environments, as rendering through a medium such as water introduces both range and color dependent effects on image capture. We constrain 3D Gaussian Splatting (3DGS), a recent advance in radiance fields enabling rapid training and real-time rendering of full 3D scenes, with a physically grounded underwater image formation model. Applying SeaSplat to the real-world scenes from SeaThru-NeRF dataset, a scene collected by an underwater vehicle in the US Virgin Islands, and simulation-degraded real-world scenes, not only do we see increased quantitative performance on rendering novel viewpoints from the scene with the medium present, but are also able to recover the underlying true color of the scene and restore renders to be without the presence of the intervening medium. We show that the underwater image formation helps learn scene structure, with better depth maps, as well as show that our improvements maintain the significant computational improvements afforded by leveraging a 3D Gaussian representation


WeCT8 Regular Session, 311	Add to My Program
Planinng and Control for Legged Robots 2

Chair: Yoshida, Eiichi	Faculty of Advanced Engineering, Tokyo University of Science
Co-Chair: Lin, Pei-Chun	National Taiwan University

11:15-11:20, Paper WeCT8.1	Add to My Program
ProNav: Proprioceptive Traversability Estimation for Legged Robot Navigation in Outdoor Environments

Elnoor, Mohamed	University of Maryland
Sathyamoorthy, Adarsh Jagan	University of Maryland
Kulathun Mudiyanselage, Kasun Weerakoon	University of Maryland, College Park
Manocha, Dinesh	University of Maryland
Keywords: Motion and Path Planning, Vision-Based Navigation, Perception-Action Coupling Abstract: We propose a novel method, ProNav, which uses proprioceptive signals for traversability estimation in challenging outdoor terrains for autonomous legged robot navigation. Our approach uses sensor data from a legged robot’s joint encoders, force, and current sensors to measure the joint positions, forces, and current consumption respectively to accurately assess a terrain’s stability, resistance to the robot’s motion, risk of entrapment, and crash. Based on these factors, we compute the appropriate robot gait to maximize stability, which leads to reduced energy consumption. Our approach can also be used to predict imminent crashes in challenging terrains and execute behaviors to preemptively avoid them. We integrate ProNav with an exteroceptive-based method to navigate realworld environments with dense vegetation, high granularity, negative obstacles, etc. Our method shows an improvement up to 40% in terms of success rate and up to 15.1% reduction in terms of energy consumption compared to exteroceptive-based methods.

11:20-11:25, Paper WeCT8.2	Add to My Program
MOVE: Multi-Skill Omnidirectional Legged Locomotion with Limited View in 3D Environments

Li, Songbo	Zhejiang University
Luo, Shixin	Zhejiang University
Wu, Jun	Zhejiang University
Zhu, Qiuguo	Zhejiang University
Keywords: Legged Robots, Machine Learning for Robot Control, Deep Learning for Visual Perception Abstract: Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on low-cost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.

11:25-11:30, Paper WeCT8.3	Add to My Program
Generating Diverse Challenging Terrains for Legged Robots Using Quality-Diversity Algorithm

Esquerre-Pourtère, Arthur	Seoul National University
Kim, Minsoo	Graduate School of Convergence Science and Technology, Seoul Nat
Park, Jaeheung	Seoul National University
Keywords: Legged Robots, Evolutionary Robotics, Failure Detection and Recovery Abstract: While legged robots have achieved significant advancements in recent years, ensuring the robustness of their controllers on unstructured terrains remains challenging. It requires generating diverse and challenging unstructured terrains to test the robot and discover its vulnerabilities. This topic remains underexplored in the literature. This paper presents a Quality-Diversity framework to generate diverse and challenging terrains that uncover weaknesses in legged robot controllers. Our method, applied to both simulated bipedal and quadruped robots, produces an archive of terrains optimized to challenge the controller in different ways. Quantitative and qualitative analyses show that the generated archive effectively contains terrains that the robots struggled to traverse, presenting different failure modes. Interesting results were observed, including failure cases that were not necessarily expected. Experiments show that the generated terrains can also be used to improve RL-based controllers.

11:30-11:35, Paper WeCT8.4	Add to My Program
Added Mass and Accuracy of the FF-SLIP Model for Legged Swimming

Austin, Max	The University of Tokyo
Ma, Linna	Florida State University
Vasquez, Derek A.	Florida State University
Van Stratum, Brian	Florida State University
Clark, Jonathan	Florida State University
Keywords: Legged Robots, Biologically-Inspired Robots, Biomimetics Abstract: This paper presents the addition of two models for added mass to the Fluid-Field Spring-Loaded Inverted Pendulum (FF-SLIP) Model for legged swimming. The relative ability of these models to capture the increased fluid forces due to virtual mass displacement is evaluated using a two-legged swimming robot, Tadpole. We show that a simple addition to our reduced-order model can predict fluid-leg interaction forces while remaining computationally efficient.

11:35-11:40, Paper WeCT8.5	Add to My Program
A Virtual Gravity Controller for Efficient Underactuated Biped Robots

Maligianni, Despoina	National Technical University of Athens
Valouxis, Fotios	National Technical University of Athens
Kantounias, Antonios	National Technical University of Athens
Smyrli, Aikaterini	National Technical University of Athens, Athena Research Center
Papadopoulos, Evangelos	National Technical University of Athens
Keywords: Passive Walking, Underactuated Robots, Humanoid and Bipedal Locomotion Abstract: This paper introduces a virtual gravity controller for underactuated biped robots. A bio-inspired model of passive bipedal walking is used as the basis for the controller's design. An analytical expression of the controller is obtained, allowing on-line implementations of the developed control scheme. Following a design modification tailored to the controller, the robot is able to reproduce its passive gait even on level-ground. The results are verified via independent high-fidelity physics simulations of the real robot's digital twin. The active robot demonstrates significant dynamic convergence to the passive model's dynamics, with only minor motorization efforts. The developed control scheme showcases robustness and energetic efficiency, and leads the way to a design-oriented approach in active biped locomotion.

11:40-11:45, Paper WeCT8.6	Add to My Program
Stair Climbing of a Transformable Robot Using Varying Leg-Wheel Contact Points

Lai, Yen-Li	National Taiwan University
Yu, Wei-Shun	National Taiwan University
Lin, Pei-Chun	National Taiwan University
Keywords: Legged Robots, Motion Control, Wheeled Robots Abstract: Staircases are a challenging terrain frequently encountered in urban environments. While leg-wheel robots take advantage of having both legged and wheeled modes, their ability to negotiate stairs still requires careful planning. This paper presents a novel approach to developing a stair-climbing behavior for leg-wheel transformable robots. A comprehensive stair-climbing strategy is constructed by analyzing the workspace of the leg-wheel mechanism, considering the position of the robot’s center of mass, and accounting for foothold displacement owing to the possible leg-wheel forward rolling motion. This strategy enables the robot to safely navigate stairs using its leg-wheel's appropriate parts. Stability during transitions between steps is ensured, and an optimized swing trajectory is proposed to minimize slippage and impact. The approach is validated through simulations and further tested experimentally on staircases with treads of 27 cm and risers of 12 cm, as well as staircases with treads of 24 cm and risers of 14 cm. The experimental results demonstrate the effectiveness and robustness of the proposed method.


WeCT9 Regular Session, 312	Add to My Program
Geometric Foundations

Chair: Barfoot, Timothy	University of Toronto
Co-Chair: Ge, Qiaode	Stony Brook University

11:15-11:20, Paper WeCT9.1	Add to My Program
"Hierarchy of Needs" for Robots: Control Synthesis for Compositions of Hierarchical, Complex Objectives

Lin, Ruoyu	University of California, Irvine
Egerstedt, Magnus	University of California, Irvine
Keywords: Hybrid Logical/Dynamical Planning and Verification, Robot Safety, Integrated Planning and Control Abstract: Drawing inspiration from Maslow's "hierarchy of needs", this paper develops a real-time control synthesis framework for robots to address hierarchical, complex objectives, recognizing that their behaviors are inherently driven by underlying needs. Each need is encoded by the zero-superlevel set of a control barrier function (CBF), which can be time-varying, and all the needs at the same level in a hierarchy are composed into a single one through Boolean compositions of the corresponding CBFs. The effectiveness of the proposed framework is demonstrated through a hypothetical interstellar exploration mission using laboratory robots, and novel results on nonsmooth CBF and time-varying CBF are derived.

11:20-11:25, Paper WeCT9.2	Add to My Program
RM4D: A Combined Reachability and Inverse Reachability Map for Common 6-/7-Axis Robot Arms by Dimensionality Reduction to 4D

Rudorfer, Martin	Aston University
Keywords: Kinematics, Mobile Manipulation, Industrial Robots Abstract: Knowledge of a manipulator’s workspace is fundamental for a variety of tasks including robot design, grasp planning and robot base placement. Consequently, workspace representations are well studied in robotics. Two important representations are reachability maps and inverse reachability maps. The former predicts whether a given end-effector pose is reachable from where the robot currently is, and the latter suggests suitable base positions for a desired end-effector pose. Typically, the reachability map is built by discretizing the 6D space containing the robot’s workspace and determining, for each cell, whether it is reachable or not. The reachability map is subsequently inverted to build the inverse map. This is a cumbersome process which restricts the applications of such maps. In this work, we exploit commonalities of existing six and seven axis robot arms to reduce the dimension of the discretization from 6D to 4D. We propose Reachability Map 4D (RM4D), a map that only requires a single 4D data structure for both forward and inverse queries. This gives a much more compact map that can be constructed by an order of magnitude faster than existing maps, with no inversion overheads and no loss in accuracy. Finally, we showcase the efficiency gains by applying RM4D for finding suitable base positions in a scenario with 800 target grasps.

11:25-11:30, Paper WeCT9.3	Add to My Program
An Average-Distance Minimizing Motion Sweep for Bounded Spatial Objects and Its Application in B´ezier-Like Freeform Motion Generation

Liu, Huan	Stony Brook University, SUNY
Ge, Qiaode	Stony Brook University
Keywords: Kinematics, Motion and Path Planning, Motion Control Abstract: This paper uses the ellipsoidal parameters associated with volume moments of inertia of a bounded solid object to construct a motion sweep joining two poses of the solid object, in contrast to earlier works on motion interpolation in SE(3) without taking into account the shape of the moving object. The paper borrows the concept of shape-dependent object norms introduced by Kazerounian and Rastegar and refined by Chirikjian and Zhou to compute as a metric the average of the squared distances (or ASD) among all homologous points of the bounded body between two given poses and seeks to obtain an optimal interpolating motion that minimizes a combination of two ASD distances from each intermediate pose to the two given poses. It is found that the ASD minimizing motion sweep is a novel straight-line motion such that while the centroid of the object follows a straight line, the orientation of the object is constrained so that the ASD metric is minimized. Furthermore, the rotational component can be determined by polar decomposition of the linearly interpolated rotation matrices, scaled by the object's inertia parameters. As an illustration of one of its applications, this motion sweep is repeatedly applied using the de Casteljau algorithm to generate Bézier-like freeform motions, whose paths are in general dependent on the shape of the inertia ellipsoid.

11:30-11:35, Paper WeCT9.4	Add to My Program
Geometric Static Modeling Framework for Piecewise-Continuous Curved-Link Multi Point-Of-Contact Tensegrity Robots

Ervin, Lauren	University of Alabama
Vikas, Vishesh	University of Alabama
Keywords: Kinematics, Space Robotics and Automation Abstract: Tensegrities synergistically combine tensile (cable) and rigid (link) elements to achieve structural integrity, making them lightweight, packable, and impact resistant. Consequently, they have high potential for locomotion in unstructured environments. This research presents geometric modeling of a Tensegrity eXploratory Robot (TeXploR) comprised of two semi-circular, curved links held together by 12 prestressed cables and actuated with an internal mass shifting along each link. This design allows for efficient rolling with stability (e.g., tip-over on an incline). However, the unique design poses static and dynamic modeling challenges given the discontinuous nature of the semi-circular, curved links, two changing points of contact with the surface plane, and instantaneous movement of the masses along the links. The robot is modeled using a geometric approach where the holonomic constraints confirm the experimentally observed four-state hybrid system, proving TeXploR rolls along one link while pivoting about the end of the other. It also identifies the quasi-static state transition boundaries that enable a continuous change in the robot states via internal mass shifting. This is the first time in literature a non-spherical two-point contact system is kinematically and geometrically modeled. Furthermore, the static solutions are closed-form and do not require numerical exploration of the solution. The MATLAB® simulations are experimentally validated on a tetherless prototype with mean absolute error of 4.36° for the arc angles of the points of contact.

11:35-11:40, Paper WeCT9.5	Add to My Program
GISR: Geometric Initialization and Silhouette-Based Refinement for Single-View Robot Pose and Configuration Estimation

Bilic, Ivan	University of Zagreb
Maric, Filip	University of Toronto Institute for Aerospace Studies
Bonsignorio, Fabio	FER, University of Zagreb
Petrovic, Ivan	University of Zagreb
Keywords: Deep Learning for Visual Perception, Visual Learning, AI-Enabled Robotics Abstract: In autonomous robotics, measurement of the robot’s internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose. Our code will be available at https://github.com/iwhitey/GISR-robot.


WeCT10 Regular Session, 313	Add to My Program
Multi-Robot Path Planning 3

Chair: Hollinger, Geoffrey	Oregon State University
Co-Chair: Yu, Jingjin	Rutgers University

11:15-11:20, Paper WeCT10.1	Add to My Program
Stop-N-Go: Search-Based Conflict Resolution for Motion Planning of Multiple Robotic Manipulators

Han, Gidon	Sogang University
Park, Jeongwoo	Sogang University
Nam, Changjoo	Sogang University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Cooperating Robots Abstract: We address the motion planning problem for multiple robotic manipulators in packed environments where shared workspace can result in goal positions occupied or blocked by other robots unless those other robots move away to make the goal positions free. While planning in a coupled configuration space (C-space) is straightforward, it struggles to scale with the number of robots and often fails to find solutions. Decoupled planning is faster but frequently leads to conflicts between trajectories. We propose a conflict resolution approach that inserts pauses into individually planned trajectories using an A*search strategy to minimize the makespan--the total time until all robots complete their tasks. This method allows some robots to stop, enabling others to move without collisions, and maintains short distances in the C-space. It also effectively handles cases where goal positions are initially blocked by other robots. Experimental results show that our method successfully solves challenging instances where baseline methods fail to find feasible solutions.

11:20-11:25, Paper WeCT10.2	Add to My Program
Constrained Nonlinear Kaczmarz Projection on Intersections of Manifolds for Coordinated Multi-Robot Mobile Manipulation

Agrawal, Akshaya	Oregon State University
Mayer, Parker	Oregon State University
Kingston, Zachary	Purdue University
Hollinger, Geoffrey	Oregon State University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Constrained Motion Planning, Cooperating Robots Abstract: Cooperative manipulation tasks impose various structure-, task-, and robot-specific constraints on mobile manipulators. However, current methods struggle to model and solve these myriad constraints simultaneously. We propose a twofold solution: first, we model constraints as a family of manifolds amenable to simultaneous solving. Second, we introduce the constrained nonlinear Kaczmarz (cNKZ) projection technique to produce constraint-satisfying solutions. Experiments show that cNKZ dramatically outperforms baseline approaches, which cannot find solutions at all. We integrate cNKZ with a sampling-based motion planning algorithm to generate complex, coordinated motions for 3--6 mobile manipulators (18--36 DoF), with cNKZ solving up to 80 nonlinear constraints simultaneously and achieving up to a 92% success rate in cluttered environments. We also demonstrate our approach on hardware using three Turtlebot3 Waffle Pi robots with OpenMANIPULATOR-X arms.

11:25-11:30, Paper WeCT10.3	Add to My Program
Targeted Parallelization of Conflict-Based Search for Multi-Robot Path Planning

Guo, Teng	Rutgers University
Yu, Jingjin	Rutgers University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Motion and Path Planning Abstract: Multi-Robot Path Planning (MRPP) on graphs, also known as Multi-Agent PathFinding (MAPF), is a well-established NP-hard problem with critically important applications. In (near)-optimally solving MRPP, as serial computation approaches its efficiency limits, parallelization offers a promising route to extend that limit further. As a single solution is unlikely to be successful in addressing all settings, e.g., in handling small/hard or large/sparse MRPP instances, in this study, we explore a targeted parallelization effort to boost the performance of conflict-based search for MRPP. Specifically, when instances are relatively small but robots are densely packed with strong interactions, we devise a decentralized parallel algorithm that concurrently explores multiple branches that leads to markedly enhanced solution discovery. On the other hand, for large problems with sparse robot-robot interactions, we find that prioritizing node expansion and conflict resolution more promising. Our innovative multi-threaded approach to parallelizing bounded-suboptimal conflict search-based algorithms demonstrates significant improvements over baseline serial methods in success rate or runtime. Our work furthers the understanding of MRPP and charts a promising path for elevating solution quality and computational efficiency through parallel algorithmic strategies.

11:30-11:35, Paper WeCT10.4	Add to My Program
Heuristically Guided Compilation for Task Assignment and Path Finding

Chen, Zheng	Zhejiang University
Chen, Changlin	University of Science and Technology of China
Yiran, Ni	Zhejiang University
Wang, Junhao	Hefei University of Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Multi-Robot Systems Abstract: We investigate the Combined Target-Assignment and Path-Finding (TAPF) problem that computes both task assignments and collision-free paths for multiple agents, that is, each agent is required to select a target from an underlying set, reaching which leads to a payoff. There is a cost closely related to the time required for each agent to reach the goal. The objective is to maximize the minimum gain generated by the agents. We proposed a Compilation-Based Approach with Heuristics (TA-CBWH) to approximate the optimal solution, behind which are two critical ideas: (i) for a specific task assignment, we formulate an integer linear programming (ILP) and create the iteration combined with large neighborhood search (LNS) to improve the solution quality to near-optimal quickly; (ii) regarding distinct task assignments, a switching mechanism is developed to determine the most promising iteration while progressively eliminating unnecessary task assignments. Comparative experiments demonstrate that TA-CBWH outperforms a wide range of existing approaches across various maps and different numbers of agents.

11:35-11:40, Paper WeCT10.5	Add to My Program
Safety-Guaranteed Distributed Formation Control of Multi-Robot Systems Over Graphs with Rigid and Elastic Edges

Pham, Hoang	Tampere University
Ranasinghe, Nadun	Tampere University
Le, Dong	Tampere University
Atman, Made Widhi Surya	Turku University
Gusrialdi, Azwirman	Tampere University
Keywords: Multi-Robot Systems, Collision Avoidance, Distributed Robot Systems Abstract: This paper considers the problem of formation control of multi-robot systems represented by a graph featuring both rigid and elastic edges, capturing specified range tolerance to the desired inter-robot distances. The objective is to navigate the robots safely through unknown environments with obstacles, utilizing onboard sensors like LiDAR while maintaining inter-robot distance constraints. To this end, a novel cooperative control algorithm is proposed, employing quadratic programming and leveraging control barrier functions to integrate multiple control objectives seamlessly. This approach ensures a unified strategy and provides a safety certificate. Experimental validation of the proposed cooperative control algorithm is conducted using a robotic testbed.


WeCT11 Regular Session, 314	Add to My Program
Safe Control 2

Chair: Bajcsy, Andrea	Carnegie Mellon University
Co-Chair: Hu, Bin	University of Houston

11:15-11:20, Paper WeCT11.1	Add to My Program
Safety-Critical Control with Saliency Detection for Mobile Robots in Dynamic Multi-Obstacle Environments

Zhang, Yu	Technical University of Munich
Wen, Long	Technical University of Munich
Hong, Lin	Harbin Institute of Technology
Zhang, Liding	Technical University of Munich
Guo, Qun	Technische Universität München
Li, Shixin	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Robot Safety, Robust/Adaptive Control, RGB-D Perception Abstract: This paper proposes a novel dual-filter architecture utilizing RGB-D camera data and dynamic control barrier functions (D-CBFs) for real-time obstacle avoidance in unstructured environments. The proposed method efficiently handles static, suddenly appearing, and dynamic obstacles, maintaining consistent computational performance across diverse scenarios. To achieve this, two key challenges must be addressed. First, the substantial volume of pixel and depth map data requires robust, real-time processing for efficient D-CBF construction. Second, constructing D-CBFs for each obstacle in multi-obstacle scenarios increases optimization solver time. To address these challenges, we adapt the concept of salient object detection (SOD), proposing an enhanced FastSOD (E-FastSOD) method for rapid risk area identification. This approach rapidly filters out low-risk areas, while high-risk regions are mathematically represented utilizing the proposed enhanced minimal bounding circle (E-MBC) technique. We differentiate static and dynamic obstacles by comparing current and previous MBC states, employing Kalman filtering for obstacle state prediction. This setup enables efficient online D-CBF construction for each MBC, balancing computational speed with accurate obstacle representation. Subsequently, the second filter establishes buffer zones around established D-CBFs, activating only those corresponding to zones the robot actually enters, rather than all D-CBFs to increase real-time performance. We prove the system's safety and asymptotic stabilization under this architecture. Simulated and real-world experiments validate our method, demonstrating an equipped mobile robot's ability to accomplish tasks while ensuring safety across diverse, unknown scenarios.

11:20-11:25, Paper WeCT11.2	Add to My Program
Safe Coverage for Heterogeneous Systems with Limited Connectivity

Taylor, Annalisa T.	Northwestern University
Berrueta, Thomas	Northwestern University
Pinosky, Allison	Northwestern University
Murphey, Todd	Northwestern University
Keywords: Distributed Robot Systems, Robot Safety, Networked Robots Abstract: An ongoing challenge for emergency deployments is operating multi-robot teams of diverse agents under communication constraints---where inter-agent connectivity is rare. Thus, heterogeneous systems must autonomously adapt to changing conditions while maintaining safety. In this work, we develop an algorithm for heterogeneous, decentralized multi-robot systems to independently manage safety constraints with provable guarantees for safety and communication for a coverage task. We demonstrate this algorithm in scenarios where up to 100 agents must navigate a simulated cluttered environment with safety constraints that change as agents observe hazards. Further, we show that the performance of a system with a largely disconnected network is equivalent to a fully connected communication network, suggesting that treating connectivity as a constraint may be unnecessary with an appropriate control strategy.

11:25-11:30, Paper WeCT11.3	Add to My Program
Safe Control of Quadruped in Varying Dynamics Via Safety Index Adaptation

Yun, SirkHoo, Kai	Carnegie Mellon University
Chen, Rui	Carnegie Mellon University; University of Michigan;
Dunaway, Chase	New Mexico Institute of Mining and Technology
Dolan, John M.	Carnegie Mellon University
Liu, Changliu	Carnegie Mellon University
Keywords: Robot Safety, Robust/Adaptive Control, Legged Robots Abstract: Varying dynamics pose a fundamental difficulty when deploying safe control laws in the real world. Safety Index Synthesis (SIS) deeply relies on the system dynamics and once the dynamics change, the previously synthesized safety index becomes invalid. In this work, we show the real-time efficacy of Safety Index Adaptation (SIA) in varying dynamics. SIA enables real-time adaptation to the changing dynamics so that the adapted safe control law can still guarantee 1) forward invariance within a safe region and 2) finite time convergence to that safe region. This work employs SIA on a package-carrying quadruped robot, where the payload weight changes in real-time. SIA updates the safety index when the dynamics change, e.g., a change in payload weight, so that the quadruped can avoid obstacles while achieving its performance objectives. Numerical study provides theoretical guarantees for SIA and a series of hardware experiments demonstrate the effectiveness of SIA in real-world deployment in avoiding obstacles under varying dynamics.

11:30-11:35, Paper WeCT11.4	Add to My Program
Updating Robot Safety Representations Online from Natural Language Feedback

Santos, Leonardo	Universidade Federal De Minas Gerais
Li, Zirui	University of Rochester
Peters, Lasse	Delft University of Technology
Bansal, Somil	Stanford University
Bajcsy, Andrea	Carnegie Mellon University
Keywords: Robot Safety, AI-Enabled Robotics, Vision-Based Navigation Abstract: Robots must operate safely when deployed in novel and human-centered environments, like homes. Current safe control approaches typically assume that the safety constraints are known a priori, and thus, the robot can pre-compute a corresponding safety controller. While this may make sense for some safety constraints (e.g., avoiding collision with walls by analyzing a floor plan), other constraints are more complex (e.g., spills), inherently personal, context-dependent, and can only be identified at deployment time when the robot is interacting in a specific environment and with a specific person (e.g., fragile objects, expensive rugs). Here, language provides a flexible mechanism to communicate these evolving safety constraints to the robot. In this work, we use vision language models (VLMs) to interpret language feedback and the robot’s image observations to continuously update the robot’s representation of safety constraints. With these inferred constraints, we update a Hamilton-Jacobi reachability safety controller to efficiently update the robot controller to ensure ongoing safety. Through simulation and hardware experiments, we demonstrate the robot’s ability to infer and respect language-based safety constraints with the proposed approach.

11:35-11:40, Paper WeCT11.5	Add to My Program
Detecting Perception-Based Attacks Using Visual Odometry: Inconsistency Modeling and Checking on Robotic States

Xu, Yuan	Nanyang Technological University
Deng, Gelei	Nanyang Technological University
Zhang, Tianwei	Nanyang Technological University
Keywords: Robot Safety Abstract: Perception systems in robotic vehicles are crucial for safe and efficient operation, providing key state estimates necessary for planning and control. However, these systems are increasingly vulnerable to perception-based attacks, such as odometry spoofing, position spoofing, obstacle hiding, and object misclassification, which can lead to catastrophic failures. In this paper, we propose a novel approach to detect perception-based attacks by modeling inconsistencies between the physical and estimated states of the robot. Our approach offers a unified methodology for detecting different types of attacks with high accuracy and minimal computational overhead. We validate our method through extensive simulations and real-world scenarios, achieving a 99.5% success rate in detecting attacks, while maintaining a low latency (within 100ms).

11:40-11:45, Paper WeCT11.6	Add to My Program
Distributed Perception Aware Safe Leader Follower System Via Control Barrier Methods

Suganda, Richie Ryulie	University of Houston
Tran, Tony	University of Houston
Pan, Miao	University of Houston
Fan, Lei	University of Houston
Lin, Qin	University of Houston
Hu, Bin	University of Houston
Keywords: Robot Safety, Multi-Robot Systems, Vision-Based Navigation Abstract: This paper addresses a distributed leader-follower formation control problem for a group of agents, each using a body-fixed camera with a limited field of view (FOV) for state estimation. The main challenge arises from the need to coordinate the agents’ movements with their cameras’ FOV to maintain visibility of the leader for accurate and reliable state estimation. To address this challenge, we propose a novel perception-aware distributed leader-follower safe control scheme that incorporates FOV limits as state constraints. A Control Barrier Function (CBF) based quadratic program is employed to ensure the forward invariance of a safety set defined by these constraints. Furthermore, new neural network based and double bounding boxes based estimators, combined with temporal filters, are developed to estimate system states directly from real-time image data, providing consistent performance across various environments. Comparison results in the Gazebo simulator demonstrate the effectiveness and robustness of the proposed framework in two distinct environments.


WeCT12 Regular Session, 315	Add to My Program
Human-Robot Interaction 4

Chair: Turco, Enrico	Istituto Italiano Di Tecnologia
Co-Chair: Molnar, Jennifer	Georgia Institute of Technology

11:15-11:20, Paper WeCT12.1	Add to My Program
Gesturing towards Efficient Robot Control: Exploring Sensor Placement and Control Modes for Mid-Air Human-Robot Interaction

Mielke, Tonia	Otto-Von-Guericke University Magdeburg
Heinrich, Florian	Otto-Von-Guericke University Magdeburg
Hansen, Christian	Otto-Von-Guericke University Magdeburg
Keywords: Design and Human Factors, Virtual Reality and Interfaces, Sensor-based Control Abstract: While collaborative robots effectively combine robotic precision with human capabilities, traditional control methods such as button presses or hand guidance can be slow and physically demanding. This has led to an increasing interest in natural user interfaces that integrate hand gesture-based interactions for more intuitive and flexible robot control. Therefore, this paper systematically explores mid-air robot control by comparing position and rate control modes with different state-of-the-art and novel sensor placements. A user study was conducted to evaluate each combination in terms of accuracy, task duration, perceived workload, and physical exertion. Our results indicate that position control is more efficient than rate control. Traditional desk-mounted sensors can provide a good balance between accuracy and comfort. However, robot-mounted sensors are a viable alternative for short-term, accurate control with less spatial requirements. Leg-mounted sensors, while comfortable, pose challenges to hand-eye coordination. Based on these findings, we provide design implications for improving the usability and comfort of mid-air human-robot interaction. Future research should extend this evaluation to a wider range of tasks and environments.

11:20-11:25, Paper WeCT12.2	Add to My Program
Understanding Dynamic Human-Robot Proxemics in the Case of Four-Legged Canine-Inspired Robots

Xu, Xiangmin	University of Glasgow
Meng, Zhen	University of Glasgow
Li, Liying Emma	University of Glasgow
Khamis, Mohamed	University of Glasgow
Zhao, Philip Guodong	University of Manchester, UK
Robin, Bretin	University of Glasgow
Keywords: Physical Human-Robot Interaction, Social HRI, Safety in HRI Abstract: The integration of humanoid and animal-shaped robots into specialized domains, such as healthcare, multi-terrain operations, and psychotherapy, necessitates a deep understanding of proxemics—the study of spatial behavior that governs effective human-robot interactions. Unlike traditional robots in manufacturing or logistics, these robots must navigate complex human environments where maintaining appropriate physical and psychological distances is crucial for seamless interaction. This study explores the application of proxemics in human-robot interactions, focusing specifically on quadruped robots, which present unique challenges and opportunities due to their lifelike movement and form. Utilizing a motion capture system, we examine how different interaction postures of a canine robot influence human participants' proxemic behavior in dynamic scenarios. By capturing and analyzing position and orientation data, this research aims to identify key factors that affect proxemic distances and inform the design of socially acceptable robots. The findings underscore the importance of adhering to human psychological and physical distancing norms in robot design, ensuring that autonomous systems can coexist harmoniously with humans.

11:25-11:30, Paper WeCT12.3	Add to My Program
Autonomous Navigation in Crowded Space Using Multi-Sensory Data Fusion

Ananna, Nourin Siddique	BRAC University
Saif, Mollah Md	BRAC University
Noor, Maisha	BRAC University
Awishi, Ishrat Tasnim	BRAC University
Rahman, Md. Khalilur	BRAC University
Alam, Md Golam Rabilul	BRAC University
Keywords: Human-Aware Motion Planning, Data Sets for Robot Learning, Sensor Fusion Abstract: Autonomous navigation in crowded environments remains a significant challenge due to the highly dynamic and unpredictable nature of pedestrian movements. This paper presents a novel approach for socially-compliant crowd navigation by leveraging human pose tracking, trajectory prediction, and obstacle avoidance techniques. We introduce PoseTrajNet, an end-to-end autonomous agent navigation pipeline that integrates YOLOv8 for object detection, BlazePose for real-time human pose estimation, and a custom trajectory prediction model drawing on concepts from Social GANs. PoseTrajNet employs pose keypoints as socially-compliant features to anticipate pedestrian trajectories, enabling proactive path planning and dynamic safe radius adjustments for obstacle avoidance. Extensive evaluations on standard datasets demonstrate PoseTrajNet's effectiveness in seamless crowd navigation, outperforming baselines while adhering to social norms.

11:30-11:35, Paper WeCT12.4	Add to My Program
Feasibility-Aware Imitation Learning from Observation through a Hand-Mounted Demonstration Interface

Takahashi, Kei	Nara Institute of Science and Technology
Sasaki, Hikaru	Nara Institute of Science and Technology
Matsubara, Takamitsu	Nara Institute of Science and Technology
Keywords: Imitation Learning, Learning from Demonstration Abstract: Imitation learning through a demonstration interface is expected to learn policies for robot automation from intuitive human demonstrations. However, due to the differences in human and robot movement characteristics, a human expert might unintentionally demonstrate an action that the robot cannot execute. We propose feasibility-aware behavior cloning from observation (FABCO). In the FABCO framework, the feasibility of each demonstration is assessed using the robot's pre-trained forward and inverse dynamics models. This feasibility information is provided as visual feedback to the demonstrators, encouraging them to refine their demonstrations. During policy learning, estimated feasibility serves as a weight for the demonstration data, improving both the data efficiency and the robustness of the learned policy. We experimentally validated FABCO's effectiveness by applying it to a pipette insertion task involving a pipette and a vial. Four participants assessed the impact of the feasibility feedback and the weighted policy learning in FABCO. Additionally, we used the NASA Task Load Index (NASA-TLX) to evaluate the workload induced by demonstrations with visual feedback.

11:35-11:40, Paper WeCT12.5	Add to My Program
Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination

Boguslavskii, Nikita	Worcester Polytechnic Institute (WPI)
Genua, Lorena Maria	Worcester Polytechnic Institute
Li, Zhi	Worcester Polytechnic Institute
Keywords: Telerobotics and Teleoperation, Human-Robot Collaboration, Human Factors and Human-in-the-Loop Abstract: Recently, many humanoid robots have been increasingly deployed in various facilities, including hospitals and assisted living environments, where they are often remotely controlled by human operators. Their kinematic redundancy enhances reachability and manipulability, enabling them to navigate complex, cluttered environments and perform a wide range of tasks. However, this redundancy also presents significant control challenges, particularly in coordinating the movements of the robot's macro-micro structure (torso and arms). Therefore, we propose various human-robot collaborative (HRC) methods for coordinating the torso and arm of remotely controlled mobile humanoid robots, aiming to balance autonomy and human input to enhance system efficiency and task execution. The proposed methods include human-initiated approaches, where users manually control torso movements, and robot-initiated approaches, which autonomously coordinate torso and arm based on factors such as reachability, task goal, or inferred human intent. We conducted a user study with N=17 participants to compare the proposed approaches in terms of task performance, manipulability, and energy efficiency, and analyzed which methods were preferred by participants.

11:40-11:45, Paper WeCT12.6	Add to My Program
Soft Human-Robot Handover Using a Vision-Based Pipeline

Castellani, Chiara	Istituto Italiano Di Tecnologia
Turco, Enrico	Istituto Italiano Di Tecnologia
Bo, Valerio	Istituto Italiano Di Tecnologia
Malvezzi, Monica	University of Siena
Prattichizzo, Domenico	University of Siena
Costante, Gabriele	University of Perugia
Pozzi, Maria	University of Siena
Keywords: Grasping, Soft Robot Applications, Physical Human-Robot Interaction Abstract: Handing over objects is an essential task in human-robot collaborative scenarios. Previous studies have predominantly employed rigid grippers to perform the handover, focusing their efforts on the generation of grasps that avoid physical contact with people. In this paper, instead, we present a vision-based open-palm handover solution where a soft robotic hand exploits on purpose the contact with the human hand for improved grasp success and robustness. In particular, the human-robot physical interaction allows the robotic hand to slide over the human palm surface and firmly cage the object. The identification of the human hand plane and the object pose is achieved through a versatile perception pipeline that exploits a single RGB-D camera. Through several experimental trials we show that the system achieves successful grasps over multiple objects with different geometries and textures. We also conduct a comparative analysis between the proposed soft handover method and a baseline approach, evaluating their robustness to uncertainties on the object position. Lastly, a user study with 30 participants is conducted to evaluate the users’ perception of the human-robot interaction during the handover. Obtained results highlight the effectiveness of the proposed pipeline with different users and an overall users’ preference for the soft handover.


WeCT13 Regular Session, 316	Add to My Program
Soft Robotic Grasping 2

Chair: Wang, Wei	University of Wisconsin-Madison
Co-Chair: Vikas, Vishesh	University of Alabama

11:15-11:20, Paper WeCT13.1	Add to My Program
Utilizing Bioinspired Soft Modular Appendages for Grasping and Locomotion in Multi-Legged Robots on Ground and Underwater

Siddiquee, Abu Nayem Md. Asraf	Graduate Teaching Assistant - University of Notre Dame
Ozkan-Aydin, Yasemin	University of Notre Dame
Keywords: Soft Robot Applications, Biologically-Inspired Robots, Soft Sensors and Actuators Abstract: Soft robots can adapt to their environments, which makes them suitable for deploying in disaster areas and agricultural fields, where their mobility is constrained by complex terrain. One of the main challenges in developing soft terrestrial robots is that the robot must be soft enough to adapt to its environment, but also rigid enough to exert adequate force on the ground to locomote. In this letter, we report a pneumatically driven, soft modular appendage made of silicone for a terrestrial robot capable of generating specific mechanical movement to locomote in the desired direction. We used Finite Element Analysis (FEA) simulations to assess the soft leg’s bending behavior, validated against the physical leg. In addition, we performed blocked force analysis to understand its force generation capabilities. We developed a soft-rigid- bodied tethered robot prototype and tested it on the ground and underwater environments to evaluate its locomotion performance. The robot demonstrated successful forward and backward movement as well as left and right turns, both on the ground and underwater. We explored the object manipulation and transportation capability of the robot by adding two additional soft appendages as a gripper. The robot demonstrated its ability to effectively manipulate and transport objects of varying nature, including rigid items such as a 3D-printed plastic box and fragile objects like an egg. The maximum load-carrying capacity of the robot was also investigated both on the ground and the aquatic medium. Our design approach provides a straightforward, cost-effective, and efficient method for creating versatile soft appendages for a robot that is capable of terradynamic locomotion. This approach showcases its potential applicability in underwater search and rescue missions.

11:20-11:25, Paper WeCT13.2	Add to My Program
Design of a Novel Pneumatic Soft Gripper for Robust Adaptive Grasping

Sun, Xiantao	Anhui University
Zhong, Mingsheng	Anhui University
Tang, Zhouzheng	Anhui University
Chen, Wenjie	Anhui University
Chen, Weihai	Beihang University
Keywords: Grippers and Other End-Effectors, Mechanism Design, Soft Sensors and Actuators Abstract: Soft grippers have shown promising performance in safe and adaptive grasping tasks. However, they often suffer from limitations in grasping force. To address this challenge, this paper presents a novel pneumatic three-finger soft gripper to achieve robust adaptive grasping. The gripper consists of three identical fingers, each containing a pneumatic bending soft actuator and a pneumatic lateral soft actuator. The bending actuator features a tilted pneumatic network structure, which provides superior bending performance compared to traditional vertical pneumatic network structure. The lateral actuator is equipped with three deflection chambers at the finger root to mimic the lateral motions of a human finger. Kinematic and static models are established to predict the bending angle and grasping force of the soft finger under pressurized air. The performance of the proposed soft finger is analyzed through finite element simulations, and the effect of the chamber tilt angle is also examined. The theoretical and simulation results are compared to verify the validity of the analytical models. Finally, the proposed soft gripper is fabricated by 3D printing and molding. Experimental results show that the gripper is capable of grasping various objects of different sizes, shapes, materials, and weights, and can perform dexterous manipulation tasks, such as cap unscrewing. The proposed soft gripper exhibits significant potential for applications in robotic robust grasping tasks.

11:25-11:30, Paper WeCT13.3	Add to My Program
Hybrid Gripper with Passive Pneumatic Soft Joints for Grasping Deformable Thin Objects

Tran, Duy	Ha Noi University of Science and Technology
Ly, Hoang Hiep	Hanoi University of Science and Technology
Nguyen, Thuan	Hanoi University of Science and Technology
Mac, Thi Thoa	HUST
Nguyen, Anh	University of Liverpool
Ta, Tung D.	The University of Tokyo
Keywords: Mechanism Design, Grippers and Other End-Effectors, Soft Robot Materials and Design Abstract: Grasping a variety of objects remains a key challenge in the development of versatile robotic systems. The human hand is remarkably dexterous, capable of grasping and manipulating objects with diverse shapes, mechanical properties, and textures. Inspired by how humans use two fingers to pick up thin and large objects such as fabric or sheets of paper, we aim to develop a gripper optimized for grasping such deformable objects. Observing how the soft and flexible fingertip joints of the hand approach and grasp thin materials, a hybrid gripper design that incorporates both soft and rigid components was proposed. The gripper utilizes a soft pneumatic ring wrapped around a rigid revolute joint to create a flexible two-fingered gripper. Experiments were conducted to characterize and evaluate the gripper's performance in handling sheets of paper and other objects. Compared to rigid grippers, the proposed design improves grasping efficiency and reduces the gripping distance by up to eightfold.

11:30-11:35, Paper WeCT13.4	Add to My Program
Dexterous Three-Finger Gripper Based on Offset Trimmed Helicoids

Guan, Qinghua	Harbin Institute of Technology
Cheng, Hung Hon	EPFL
Hughes, Josie	EPFL
Keywords: Grippers and Other End-Effectors, Soft Sensors and Actuators, Soft Robot Applications Abstract: This study presents an innovative offset-trimmed helicoids (OTH) structure, featuring a tunable deformation center that emulates the flexibility of human fingers. This design significantly reduces the actuation force needed for larger elastic deformations, particularly when dealing with harder materials like thermoplastic polyurethane (TPU). The incorporation of two helically routed tendons within the finger enables both in-plane bending and lateral out-of-plane transitions, effectively expanding its workspace and allowing for variable curvature along its length. Compliance analysis indicates that the compliance at the fingertip can be fine-tuned by adjusting the mounting placement of the fingers. This customization enhances the gripper's adaptability to a diverse range of objects. By leveraging TPU's substantial elastic energy storage capacity, the gripper is capable of dynamically rotating objects at high speeds, achieving approximately 60° in just 15 milliseconds. The three-finger gripper, with its high dexterity across six degrees of freedom, has demonstrated the capability to successfully perform intricate tasks. One such example is the adept spinning of a rod within the gripper's grasp.

11:35-11:40, Paper WeCT13.5	Add to My Program
Improving Grip Stability Using Passive Compliant Microspine Arrays for Soft Robots in Unstructured Terrain

Ervin, Lauren	University of Alabama
Bezawada, Harish	The University of Alabama
Vikas, Vishesh	University of Alabama
Keywords: Compliant Joints and Mechanisms, Soft Robot Materials and Design, Field Robots Abstract: Microspine grippers are small spines commonly found on insect legs that reinforce surface interaction by engaging with asperities to increase shear force and traction. An array of such microspines, when integrated into the limbs or undercarriage of a robot, can provide the ability to maneuver uneven terrains, traverse inclines, and even climb walls. Meanwhile, the conformability and adaptability of soft robots makes them ideal candidates for applications involving traversal of complex, unstructured terrains. However, there remains a real-life realization gap for soft locomotors pertaining to their transition from controlled lab environment to the field that can be bridged by improving grip stability through effective integration of microspines. In this research, a passive, compliant microspine stacked array design is proposed to enhance the locomotion capabilities of mobile soft robots. A microspine array integration method effectively addresses the stiffness mismatch between soft, compliant, and rigid components. Additionally, a reduction in complexity results from actuation of the surface-conformable soft limb using a single actuator. The two-row, stacked microspine array configuration offers improved gripping capabilities on steep and irregular surfaces. This design is incorporated into three different robot configurations - the baseline without microspines and two others with different combinations of microspine arrays. Field experiments are conducted on surfaces of varying surface roughness and non-uniformity - concrete, brick, compact sand, and tree roots. Experimental results demonstrate that the inclusion of microspine arrays increases planar displacement an average of 10 times. The improved grip stability, repeatability, and, terrain traversability is reflected by a decrease in the relative standard deviation of the locomotion gaits.

11:40-11:45, Paper WeCT13.6	Add to My Program
Hybrid Soft Pneumatic and Tendon Actuated Finger with Selective Locking Chain Link Joints

Lin, Keng-Yu	University of Wisconsin Madison
Stonecipher, Jack	University of Wisconsin-Madison
Rusch, Zach	University of Wisconsin-Madison
Wang, Wei	University of Wisconsin-Madison
Wehner, Michael	University of Wisconsin, Madison
Keywords: Soft Robot Applications, Grippers and Other End-Effectors, Grasping Abstract: Rigid robots excel in structured conditions, but struggle in more unpredictable or populated environments. Soft robots address these difficulties, but the compliance which gives them their inherent safety also limits their ability to apply desired forces. Jamming/locking reduces this back-drivability but does not allow for directional application of force. We present a hybrid system including pneumatic and tendon actuation as well as a system of cable-driven locking modules, able lock individual joints. Combining these mechanisms yields a device which can behave as: a soft finger, a fully-rigid finger, and a locally-locking finger which mimics a traditional rigid-link robot. This finger is able to switch between these behaviors on-the-fly, allowing it to adapt to unexpected scenarios, critical for social robots. Using these modes and the ability to adapt real-time, our finger is able to complete common household tasks, difficult for current robots. We characterize the finger’s ability to resist force in three actuation modes (Pneumatic, Cable, and Locked), its ability to apply force, and its ability to actuate in 31 different configurations (plus a static all-locked configuration). We also present a demonstration in which the finger conforms to the shape of a computer mouse then clicks a mouse button, and of the finger conforms to the shape of a heavy door handle, then pulling it open. We present the design, fabrication, and characterization of this finger as a demonstration of the underlying concept, which can be broadly applied to social robotics.


WeCT14 Regular Session, 402	Add to My Program
Reconfigurable Robots

Chair: Sun, Jiefeng	Arizona State University
Co-Chair: Mehta, Ankur	UCLA

11:15-11:20, Paper WeCT14.1	Add to My Program
Enabling Framework for Constant Complexity Model in Autonomous Inter-Reconfigurable Robots (I)

Wan, Ash Yaw Sang	Singapore University of Technology and Design
Le, Anh Vu	Communication and Signal Processing Research Group Faculty of El
Moo, Chee Gen	Singapore University of Technology and Design
Sivanantham, Vinu	Singapore University of Technology and Design
Elara, Mohan Rajesh	Singapore University of Technology and Design
Keywords: Cellular and Modular Robots, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: In reconfigurable robotics, intra-reconfiguration enables a robot to change its functional abilities, while inter-reconfiguration manipulates the specification limits of the robot hardware. Although the versatility of inter-reconfigurable robots is desired in advanced autonomous systems, the O(n^3) algorithm computational time complexity challenge comes when multiple modular robots combine and reconfigure into a bigger form structure for autonomous navigation tasks. This phenomenon has limited the inter-reconfiguration potential of expansion, versatility, and robustness. In this paper, a navigation framework with non-complex transformation states is proposed for inter-reconfigurable robots to perform combining and splitting control dimensions. Simulations have shown the complexity from O(n) to constant time O(1) in the reconfiguration states of the framework on a considerable number of robot agents. Additionally, a set of inter-reconfigurable robots, Wasp Biggie, was used to demonstrate the proof-of-concept in experiments as a fully functional centralized planner system. These experiments showed outperforming results on the consistent utility of CPU consumption while performing navigation and reconfiguration.

11:20-11:25, Paper WeCT14.2	Add to My Program
Improving Coverage Performance of a Size-Reconfigurable Robot Based on Overlapping and Reconfiguration Reduction Criteria

Muthugala Arachchige, Viraj Jagathpriya Muthugala	Singapore University of Technology and Design
Samarakoon Mudiyanselage, Bhagya Prasangi Samarakoon	Singapore University of Technology and Design
Wijegunawardana, Isira Damsith	Singapore University of Technology
Elara, Mohan Rajesh	Singapore University of Technology and Design
Keywords: Motion and Path Planning, Planning under Uncertainty, Neural and Fuzzy Control Abstract: Size reconfigurable robots have been introduced for coverage applications to improve performance. The size reconfiguration ability allows a robot to access narrow areas in a smaller size while covering open spaces in a larger size, improving productivity. This paper proposes a novel CPP method consisting of an Overlapping Reduction Criterion (ORC) and a Reconfiguration Reduction Criterion (RRC) for a size-reconfigurable robot to improve performance in dynamic workspaces. A Glasius Bio-inspired Neural Network (GBNN) is adapted to guide the robot toward unvisited cells considering neural activity variation. The size variation is managed by utilizing a collection of grid maps generated for various size configurations of the robot. The RRC and ORC penalize the movements requiring size reconfigurations or creating isolated unvisited regions in the decision-making process of next movement selection yielding to reduce reconfigurations and overlapping. According to the results, the proposed CPP method surpasses state of the art in terms of performance indexes reconfiguration count, overlapping, path distance, and coverage time by significant margins.

11:25-11:30, Paper WeCT14.3	Add to My Program
CoCube: A Tabletop Modular Multi-Robot Platform for Education and Research

Liang, Shuai	Fudan University
Zhu, Songyi	Shanghai Artifcial Intelligence Laboratory
Zhonghan, Tang	University of Science and Technology of China
Li, Chenhui	Shanghai Artificial Intelligence Laboratory
Wu, Wenjie	DynaLab
Han, Jialing	Fudan University
Lin, Zemin	Shanghai Jiaotong University
You, Zhongrui	Shanghai Artifcial Intelligence Laboratory
Maloney, John	MicroBlocks
Romagosa Carrasquer, Bernat	SAP
Zhao, Bin	Northwestern Polytechnical University
Wang, Zhigang	Shanghai AI Laboratory
Zhang, Zhinan	Shanghai Jiao Tong University
Li, Xuelong	Northwestern Polytechnical University
Keywords: Multi-Robot Systems, Education Robotics, Cellular and Modular Robots Abstract: This paper presents CoCube, a tabletop modular robotics platform designed for robotics education and multi-robot algorithm research. CoCube is characterized by its low cost, low floors, high ceilings and wide walls, offering flexibility and broad applicability across various use cases. The platform comprises four key components: CoCube robots, which integrate wireless communication, movement and interaction; CoModules, which provide versatile external functionality; CoMaps, which enable high-precision localization via microdot patterns on regular printed paper; and CoTags for interaction. CoCube operates on MicroBlocks, a blocks programming language for physical computing inspired by Scratch, a widely-used coding language with a simple visual interface that makes programming accessible to young learners. It offers users both flexibility and ease of use, with advanced API support for more complex applications. This paper details the design of the CoCube platform and demonstrates its potential in both educational and research contexts.

11:30-11:35, Paper WeCT14.4	Add to My Program
Loopy Movements: Emergence of Rotation in a Multicellular Robot

Smith, Trevor	West Virginia University
Gu, Yu	West Virginia University
Keywords: Cellular and Modular Robots, Swarm Robotics, Biologically-Inspired Robots Abstract: Unlike most human-engineered systems, many biological systems rely on emergent behaviors from low-level interactions, enabling greater diversity and superior adaptation to complex, dynamic environments. This study explores emergent decentralized rotation in the Loopy multicellular robot, composed of homogeneous, physically linked, 1-degree-of-freedom cells. Inspired by biological systems like sunflowers, Loopy uses simple local interactions—diffusion, reaction, and active transport of simulated chemicals, called morphogens—without centralized control or knowledge of its global morphology. Through these interactions, the robot self-organizes to achieve coordinated rotational motion and forms lobes—local protrusions created by clusters of motor cells. This study investigates how these interactions drive Loopy’s rotation, the impact of its morphology, and its resilience to actuator failures. Our findings reveal two distinct behaviors: 1) inner valleys between lobes rotate faster than the outer peaks, contrasting with rigid body dynamics, and 2) cells rotate in the opposite direction of the overall morphology. The experiments show that while Loopy’s morphology does not affect its angular velocity relative to its cells, larger lobes increase cellular rotation and decrease morphology rotation relative to the environment. Even with up to one-third of its actuators disabled and significant morphological changes, Loopy maintains its rotational abilities, highlighting the potential of decentralized, bio-inspired strategies for resilient and adaptable robotic systems.

11:35-11:40, Paper WeCT14.5	Add to My Program
Enhancing Connection Strength in Freeform Modular Reconfigurable Robots through Holey Sphere and Gripper Mechanisms

Wang, Peiqi	The Chinese University of Hong Kong, Shenzhen
Liang, Guanqi	The Chinese University of Hong Kong, Shenzhen
Zhao, Da	The Chinese University of Hong Kong
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Keywords: Cellular and Modular Robots, Mechanism Design, Distributed Robot Systems Abstract: Freeform modular self-reconfigurable robot (MSRR) systems overcome traditional docking limitations, enabling rapid and continuous connections between modules in any direction. Recent advancements in freeform MSRR technology have significantly enhanced connectivity and mobility. However, limitations in connector strength and operational efficiency in existing designs restrict performance. This paper proposes a rigid freeform connector and a rigid magnetic track design to improve the connection and motion performance of the SnailBot. Each SnailBot is equipped with a multi-channel rope-driven gripper, a metal spherical shell with densely distributed circular holes on the back, and a rigid chain design conforming to the spherical surface. This combination allows each SnailBot to move precisely along the surface of a peer, facilitated by the ferromagnetic spherical shell and magnetic track. The integration of the gripper and spherical shell hole array provides robust inter-module connections in any position and orientation. The effectiveness of these designs has been validated through a series of experiments and analyses, demonstrating improved connection and motion performance in the SnailBot dual-mode connector system and expanding its potential applications and functional capabilities.


WeCT15 Regular Session, 403	Add to My Program
Bimanual Manipulation 2

Chair: Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Co-Chair: Gupta, Satyandra K.	University of Southern California

11:15-11:20, Paper WeCT15.1	Add to My Program
Learning Dexterous Bimanual Catch Skills through Adversarial-Cooperative Heterogeneous-Agent Reinforcement Learning

Kim, Taewoo	Electronics and Telecommunications Research Institute
Yoon, Youngwoo	Electronics and Telecommunications Research Institute
Kim, Jaehong	ETRI
Keywords: Bimanual Manipulation, Reinforcement Learning, Multifingered Hands Abstract: Robotic catching has traditionally focused on single-handed systems, which are limited in their ability to handle larger or more complex objects. In contrast, bimanual catching offers significant potential for improved dexterity and object handling but introduces new challenges in coordination and control. In this paper, we propose a novel framework for learning dexterous bimanual catching skills using Heterogeneous-Agent Reinforcement Learning (HARL). Our approach introduces an adversarial reward scheme, where a throw agent increases the difficulty of throws adjusting speed while a catch agent learns to coordinate both hands to catch objects under these evolving conditions. We evaluate the framework in simulated environments using 15 different objects, demonstrating robustness and versatility in handling diverse objects. Our method achieved approximately a 2x increase in catching reward compared to single-agent baselines across 15 diverse objects.

11:20-11:25, Paper WeCT15.2	Add to My Program
Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

Zhuang, Lipeng	University of Glasgow
Fan, Shiyu	University of Glasgow
Ru, Yingdong	University of Glasgow
Audonnet, Florent	University of Glasgow
Henderson, Paul	University of Glasgow
Aragon-Camarasa, Gerardo	University of Glasgow
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Bimanual Manipulation Abstract: We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at https://cvas-ug.github.io/flat-n-fold

11:25-11:30, Paper WeCT15.3	Add to My Program
TWIN: Two-Handed Intelligent Benchmark for Bimanual Manipulation

Grotz, Markus	University of Washington (UW)
Shridhar, Mohit	University of Washington
Chao, Yu-Wei	NVIDIA
Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Fox, Dieter	University of Washington
Keywords: Bimanual Manipulation, Software Tools for Benchmarking and Reproducibility, Imitation Learning Abstract: Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot. We open-source our code and benchmark, which comprises 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To initiate the benchmark, we extended multiple state-of-the-art techniques to the domain of bimanual manipulation. The project website with code is available at: http://bimanual.github.io.

11:30-11:35, Paper WeCT15.4	Add to My Program
Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation

Chuang, Ian	University of California, Davis
Lee, Andrew	University of California, Davis
Gao, Dechen	University of California, Davis
Naddaf Shargh, Mohammad Mahdi	University of California - Davis
Soltani, Iman	University of California, Davis
Keywords: Perception for Grasping and Manipulation, Dual Arm Manipulation, Dexterous Manipulation Abstract: Imitation learning has demonstrated significant potential in performing high-precision manipulation tasks using visual feedback. However, it is common practice in imitation learning for cameras to be fixed in place, resulting in issues like occlusion and limited field of view. Furthermore, cameras are often placed in broad, general locations, without an effective viewpoint specific to the robot's task. In this work, we investigate the utility of active vision (AV) for imitation learning and manipulation, in which, in addition to the manipulation policy, the robot learns an AV policy from human demonstrations to dynamically change the robot's camera viewpoint to obtain better information about its environment and the given task. We introduce AV-ALOHA, a new bimanual teleoperation robot system with AV, an extension of the ALOHA 2 robot system, incorporating an additional 7-DoF robot arm that only carries a stereo camera and is solely tasked with finding the best viewpoint. This camera streams stereo video to an operator wearing a virtual reality (VR) headset, allowing the operator to control the camera pose using head and body movements. The system provides an immersive teleoperation experience, with bimanual first-person control, enabling the operator to dynamically explore and search the scene and simultaneously interact with the environment. We conduct imitation learning experiments of our system both in real-world and in simulation, across a variety of tasks that emphasize viewpoint planning. Our results demonstrate the effectiveness of human-guided AV for imitation learning, showing significant improvements over fixed cameras in tasks with limited visibility. Project website: https://soltanilara.github.io/av-aloha/

11:35-11:40, Paper WeCT15.5	Add to My Program
Force-Conditioned Diffusion Policies for Compliant Sheet Separation Tasks in Bimanual Robotic Cells

Shukla, Rishabh	University of Southern California
Talan, Raj	University of Southern California
Moode, Samrudh	University of Southern California
Dhanaraj, Neel	University of Southern California
Kang, Jeon Ho	University of Southern California
Gupta, Satyandra K.	University of Southern California
Keywords: Learning from Demonstration, Bimanual Manipulation, Disassembly Abstract: Disassembly is a critical challenge in maintenance and service tasks, particularly in high-precision operations such as electric vehicle (EV) battery recycling. Tasks like prying-open sealed battery covers require precise manipulation and controlled force application. In our approach we collect human demonstrations using a motion capture system, enabling the robot to learn from human-expert disassembly strategies. These demonstrations train a bimanual robotic system in which one arm exerts force with a specialized tool while the other manipulates and removes sealed components. Our method builds on a diffusion-based policy and integrates real-time force sensing to adapt its actions as contact conditions change. We decompose the demonstrations into distinct sub-tasks and apply data augmentation, thereby reducing the number of demonstrations needed and mitigating potential task failures. Our results show that the proposed method, even with a small dataset, achieves a high task success rate and efficiency compared to a standard diffusion technique. We demonstrate in a real-world application that the bimanual system effectively executes chiseling and peeling actions to separate bonded sheet from a substrate.

11:40-11:45, Paper WeCT15.6	Add to My Program
A Comparison of Imitation Learning Algorithms for Bimanual Manipulation

Drolet, Michael	Technische Universität Darmstadt
Stepputtis, Simon	Carnegie Mellon University
Kailas, Siva	Carnegie Mellon University
Jain, Ajinkya	Intrinsic Innovation LLC
Peters, Jan	Technische Universität Darmstadt
Schaal, Stefan	Google X
Ben Amor, Heni	Arizona State University
Keywords: Imitation Learning, Bimanual Manipulation, Learning from Demonstration Abstract: Amidst the wide popularity of imitation learning algorithms in robotics, their properties regarding hyperparameter sensitivity, ease of training, data efficiency, and performance have not been well-studied in high-precision industry-inspired environments. In this work, we demonstrate the limitations and benefits of prominent imitation learning approaches and analyze their capabilities regarding these properties. We evaluate each algorithm on a complex bimanual manipulation task involving an over-constrained dynamics system in a setting involving multiple contacts between the manipulated object and the environment. While we find that imitation learning is well suited to solve such complex tasks, not all algorithms are equal in terms of handling environmental and hyperparameter perturbations, training requirements, performance, and ease of use. We investigate the empirical influence of these key characteristics by employing a carefully designed experimental procedure and learning environment.


WeCT16 Regular Session, 404	Add to My Program
Grasping 2

Chair: Jia, Yan-Bin	Iowa State University
Co-Chair: Spenko, Matthew	Illinois Institute of Technology

11:15-11:20, Paper WeCT16.1	Add to My Program
Trajectory Optimization for Dynamically Grasping Irregular Objects

Vu, Minh Nhat	TU Wien, Austria
Grander, Florian	EGGER Holzwerkstoffe Brilon GmbH
Nguyen, Anh	University of Liverpool
Unger, Christoph	TU Wien
Keywords: Industrial Robots, Motion and Path Planning Abstract: This paper presents a novel trajectory optimization framework for grasping a thin object with the schunk (SDH2) hand-mounted on a Kuka robot. Unlike a conventional grasping task, we aim to achieve a ``dynamic grasp'' of the object, which requires continuous movement during the grasping process. The trajectory framework comprises two phases. Firstly, in a specified time limit of SI{10}{second}, initial offline trajectories are computed for a seamless motion from an initial configuration of the robot to grasp the object and deliver it to a pre-defined target location. Secondly, fast online trajectory optimization is implemented to update robot trajectories in real time within 100 milliseconds. This helps to mitigate pose estimation errors from the vision system. To account for model inaccuracies, disturbances, and other non-modeled effects, trajectory tracking controllers for both the robot and the gripper are implemented to execute the optimal trajectories from the proposed framework. Simulation and experimental results effectively demonstrate the performance of the trajectory planning framework in real-world scenarios.

11:20-11:25, Paper WeCT16.2	Add to My Program
DistillGrasp: Integrating Features Correlation with Knowledge Distillation for Depth Completion of Transparent Objects

Huang, Yiheng	Guangdong University of Technology
Chen, Junhong	Guangdong University of Technology
Michiels, Nick	Hasselt University - Flanders Make - Expertise Centre for Digita
Asim, Muhammad	Guangdong University of Technology
Claesen, Luc	Hasselt Univeristy
Liu, Wenyin	Guangdong University of Technology
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping Abstract: Due to the visual properties of reflection and refraction, RGB-D cameras cannot accurately capture the depth of transparent objects, leading to incomplete depth maps. To fill in the missing points, recent studies tend to explore new visual features and design complex networks to reconstruct the depth, however, these approaches tremendously increase computation, and the correlation of different visual features remains a problem. To this end, we propose an efficient depth completion network named DistillGrasp which distillates knowledge from the teacher branch to the student branch. Specifically, in the teacher branch, we design a position correlation block (PCB) that leverages RGB images as the query and key to search for the corresponding values, guiding the model to establish correct correspondence between two features and transfer it to the transparent areas. For the student branch, we propose a consistent feature correlation module (CFCM) that retains the reliable regions of RGB images and depth maps respectively according to the consistency and adopts a CNN to capture the pairwise relationship for depth completion. To avoid the student branch only learning regional features from the teacher branch, we devise a distillation loss that not only considers the distance loss but also the object structure and edge information. Extensive experiments conducted on the ClearGrasp dataset manifest that our teacher network outperforms state-of-the-art methods in terms of accuracy and generalization, and the student network achieves competitive results with a higher speed of 48 FPS. In addition, the significant improvement in a real-world robotic grasping system illustrates the effectiveness and robustness of our proposed system.

11:25-11:30, Paper WeCT16.3	Add to My Program
Real-Time Grasp Quality in Boundary-Constrained Granular Swarm Robots

Mulroy, Declan	Illinois Institute of Technology
Cañones Bonham, David Francesc	Illinois Institute of Technology
Spenko, Matthew	Illinois Institute of Technology
Srivastava, Ankit	Illinois Institute of Technology
Keywords: Grasping, Swarm Robotics, Motion Control Abstract: Soft robotic grippers offer advantages over rigid end effectors but are typically coupled to a rigid robot for locomotion. In contrast, this paper details a soft robot for both locomotion and grasping. The system is a type of boundary- constrained granular swarm robot, which is composed of a closed-loop series of active (capable of locomotion) sub-robots. Prior work has shown how this type of robot is capable of loco- motion and grasping. For this paper, we propose a new grasping strategy and demonstrate real-time grasp quality evaluation using pressure sensors and the Ferrari-Canny grasp metric. The grasping strategy leverages gradient-based control via distance functions and dynamic system planning to achieve desired robot geometries for effective grasping. Previous research primarily used pull tests to evaluate grasping efficacy, which lacked real- time feedback on grasp quality. Simulated and experimental results confirm the effectiveness of this method.

11:30-11:35, Paper WeCT16.4	Add to My Program
Learning Dual-Arm Coordination for Grasping Large Flat Objects

Wang, Yongliang	University of Groningen
Kasaei, Hamidreza	University of Groningen
Keywords: Dexterous Manipulation, Bimanual Manipulation, Dual Arm Manipulation Abstract: Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.

11:35-11:40, Paper WeCT16.5	Add to My Program
QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity

Huber, Johann	ISIR, Sorbonne Université
Hélénon, François	Sorbonne Université
Kappel, Mathilde	Institut Des Systèmes Intelligents Et De Robotique
Páez Ubieta, Ignacio de Loyola	University of Alicante
Gil, Pablo	University of Alicante
Puente, Santiago	University of Alicante
Ben Amar, Faiz	Université Pierre Et Marie Curie, Paris 6
Doncieux, Stéphane	Sorbonne University
Keywords: Grasping, Data Sets for Robot Learning, Evolutionary Robotics Abstract: Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.

11:40-11:45, Paper WeCT16.6	Add to My Program
Patch Tree: Exploiting the Gauss Map and Principal Component Analysis for Robotic Grasping

Jia, Yan-Bin	Iowa State University
Xue, Yuechuan	Amazon.com
Tang, Ling	Iowa State University
Keywords: Grasping, In-Hand Manipulation Abstract: Grasp planning must consider an object's local geometry (at the finger contacts), for the range of applicable wrenches under friction, and its global geometry, for force closure and grasp quality. Most everyday objects have curved surfaces unamenable to a pure combinatorial approach but treatable with tools from differential geometry. Our idea is to ``discretize'' such a surface in a top-down fashion into elementary patches (e-patches), each consisting of points that would yield close enough wrenches. Preprocessing based on Gaussian curvature decomposes the surface into strictly convex, strictly concave, ruled, and saddle patches. The Gauss map guides the subdivision of any patch with a large variation in the contact force direction, with the aid of a Platonic solid. The principal component analysis (PCA) further subdivides any patch that has a large variation in torque. The final structure is called a {it patch tree}, which stores e-patches at its leaves, and force or torque ranges at its internal nodes. Grasp synthesis and optimization operates on the patch tree with a stack to efficiently prune away non-promising finger placements. Simulation and experiment with a Shadow Hand have been conducted over everyday items. The patch tree exhibits different levels of surface granularity. It has a good promise for efficient planning of finger gaits to carry out grasping and tool manipulation.


WeCT17 Regular Session, 405	Add to My Program
Localization 4

Chair: Napp, Nils	Cornell University
Co-Chair: Laconte, Johann	French National Research Institute for Agriculture, Food and the Environment (INRAE)

11:15-11:20, Paper WeCT17.1	Add to My Program
Improved Bag-Of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization

Wilhelm, Aaron	Cornell University
Napp, Nils	Cornell University
Keywords: Localization, SLAM, Mapping Abstract: Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate k-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.

11:20-11:25, Paper WeCT17.2	Add to My Program
Improving Indoor Localization Accuracy by Using an Efficient Implicit Neural Map Representation

Kuang, Haofei	University of Bonn
Pan, Yue	University of Bonn
Zhong, Xingguang	University of Bonn
Wiesmann, Louis	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Keywords: Localization, Mapping, Deep Learning Methods Abstract: Globally localizing a mobile robot in a known map is often a foundation for enabling robots to navigate and operate autonomously. In indoor environments, traditional Monte Carlo localization based on occupancy grid maps is considered the gold standard, but its accuracy is limited by the representation capabilities of the occupancy grid map. In this paper, we address the problem of building an effective map representation that allows to accurately perform probabilistic global localization. To this end, we propose an implicit neural map representation that is able to capture positional and directional geometric features from 2D LiDAR scans to efficiently represent the environment and learn a neural network that is able to predict both, the non-projective signed distance and a direction-aware projective distance for an arbitrary point in the mapped environment. This combination of neural map representation with a light-weight neural network allows us to design an efficient observation model within a conventional Monte Carlo localization framework for pose estimation of a robot in real time. We evaluated our approach to indoor localization on a publicly available dataset for global localization and the experimental results indicate that our approach is able to more accurately localize a mobile robot than other localization approaches employing occupancy or existing neural map representations. In contrast to other approaches employing an implicit neural map representation for 2D LiDAR localization, our approach allows to perform real-time pose tracking after convergence and near real-time global localization. The code of our approach is available at: url{https://github.com/PRBonn/enm-mcl}.

11:25-11:30, Paper WeCT17.3	Add to My Program
Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles

Wu, Qiyuan	Cornell University
Campbell, Mark	Cornell University
Keywords: Localization, Sensor Fusion, Deep Learning for Visual Perception Abstract: The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs. highway, dynamic vs. static scenes, winter vs. summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.

11:30-11:35, Paper WeCT17.4	Add to My Program
LiLoc: Lifelong Localization Using Adaptive Submap Joining and Egocentric Factor Graph

Fang, Yixin	Southeast University
Li, Yanyan	Technical University of Munich
Qian, Kun	Southeast University
Tombari, Federico	Technische Universität München
Wang, Yue	Zhejiang University
Lee, Gim Hee	National University of Singapore
Keywords: Localization, Mapping, SLAM Abstract: This paper proposes a versatile graph-based lifelong localization framework, LiLoc, which enhances its timeliness by maintaining a single central session while improves the accuracy through multi-modal factors between the central and subsidiary sessions. First, an adaptive submap joining strategy is employed to generate prior submaps (keyframes and poses) for the central session, and to provide priors for subsidiaries when constraints are needed for robust localization. Next, a coarse-to-fine pose initialization for subsidiary sessions is performed using vertical recognition and ICP refinement in the global coordinate frame. To elevate the accuracy of subsequent localization, we propose an egocentric factor graph (EFG) module that integrates the IMU preintegration, LiDAR odometry and scan match factors in a joint optimization manner. Specifically, the scan match factors are constructed by a novel propagation model that efficiently distributes the prior constrains as edges to the relevant prior pose nodes, weighted by noises based on keyframe registration errors. Additionally, the framework supports flexible switching between two modes: relocalization (RLM) and incremental localization (ILM) based on the proposed overlap-based mechanism to select or update the prior submaps from central session. The proposed LiLoc is tested on public and custom datasets, demonstrating accurate localization performance against state-of-the-art methods. Our codes will be publicly available on https://github.com/Yixin-F/LiLoc.

11:35-11:40, Paper WeCT17.5	Add to My Program
ReFeree: Radar-Based Lightweight and Robust Localization Using Feature and Free Space

Kim, Hogyun	Inha University
Choi, Byunghee	Inha University
Choi, Euncheol	Inha University
Cho, Younggun	Inha University
Keywords: Localization, SLAM, Field Robots Abstract: Place recognition plays an important role in achieving robust long-term autonomy. Real-world robots face a wide range of weather conditions (e.g. overcast, heavy rain, and snowing) and most sensors (i.e. camera, LiDAR) essentially functioning within or near-visible electromagnetic waves are sensitive to adverse weather conditions,making reliable localization difficult. In contrast, radar is gaining traction due to long electromagnetic waves, which are less affected by environmental changes and weather independence. In this work, we propose a radar-based lightweight and robust place recognition. We achieve rotational invariance and lightweight by selecting a one-dimensional ring-shaped description and robustness by mitigating the impact of false detection utilizing opposite noise characteristics between free space and feature. In addition, the initial heading can be estimated, which can assist in building a SLAM pipeline that combines odometry and registration, which takes into account onboard computing. The proposed method was tested for rigorous validation across various scenarios (i.e. single session, multi-session, and different weather conditions). In particular, we validate our descriptor achieving reliable place recognition performance through the results of extreme environments that lacked structural information such as an OORD dataset.

11:40-11:45, Paper WeCT17.6	Add to My Program
On the Consistency of Multi-Robot Cooperative Localization: A Transformation-Based Approach

Hao, Ning	Harbin Institute of Technology
He, Fenghua	Harbin Institute of Technology
Tian, Chungeng	Harbin Institute of Technology
Hou, Yi	Harbin Institute of Technology
Keywords: Localization, SLAM, Multi-Robot Systems Abstract: This paper investigates the inconsistency problem caused by the mismatch of observability properties commonly found in multi-robot cooperative localization (CL) and simultaneous localization and mapping (SLAM). To address this issue, we propose a transformation-based approach that introduces a linear time-varying transformation to ensure the transformed system possesses a state-independent unobservable subspace. Consequently, its observability properties remain unaffected by the linearization points. We establish the relationship between the unobservable subspaces of the original and transformed systems, guiding the design of the time-varying transformation. We then present a novel estimator based on this method, referred to as the Transformed EKF (T-EKF), which utilizes the transformed system for state estimation, thereby ensuring correct observability and thus consistency. The proposed approach has been extensively validated through both Monte Carlo simulations and real-world experiments, demonstrating better performance in terms of both accuracy and consistency compared to state-of-the-art methods.


WeCT18 Regular Session, 406	Add to My Program
Software Tools 2

Chair: Wauters, Jolan	Ghent University
Co-Chair: Schlegel, Christian	University of Applied Sciences Ulm

11:15-11:20, Paper WeCT18.1	Add to My Program
Chemistry3D: Robotic Interaction Toolkit for Chemistry Experiments

Li, Shoujie	Tsinghua Shenzhen International Graduate School
Huang, Yan	Tsinghua University
Guo, Changqing	Tsinghua University
Wu, Tong	Tsinghua University
Zhang, Jiawei	Tsinghua University
Zhang, Linrui	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Software Tools for Benchmarking and Reproducibility, Software Architecture for Robotic and Automation, Methods and Tools for Robot System Design Abstract: The advent of simulation engines has revolutionized learning and operational efficiency for robots, offering cost-effective and swift pipelines. However, the lack of a universal simulation platform tailored for chemical scenarios impedes progress in robotic manipulation and visualization of reaction processes. Addressing this void, we present Chemistry3D, an innovative toolkit that integrates extensive chemical and robotic knowledge. Chemistry3D not only enables robots to perform chemical experiments but also provides real-time visualization of temperature, color, and pH changes during reactions. Built on the NVIDIA Omniverse platform, Chemistry3D offers interfaces for robot operation, visual inspection, and liquid flow control, facilitating the simulation of special objects such as liquids and transparent entities. Leveraging this toolkit, we have devised RL tasks, object detection, and robot operation scenarios. Additionally, to discern disparities between the rendering engine and the real world, we conducted transparent object detection experiments using Sim2Real, validating the toolkit's exceptional simulation performance. The source code is available at https://github.com/huangyan28/Chemistry3D, and a related tutorial can be found at https://www.omni-chemistry.com.

11:20-11:25, Paper WeCT18.2	Add to My Program
Introducing KUGE: A Simultaneous Control Co-Design Architecture and Its Application to Aerial Robotics Development

Wauters, Jolan	Ghent University
Lefebvre, Tom	Ghent University
Crevecoeur, Guillaume	Ghent University
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Aerial Systems: Applications Abstract: The increasing complexity of tasks performed by hybrid aerial robotic systems, such as tail-sitters, demands a more integrated approach to their design. Traditional sequential design methods fall short because they separate the control system design from the conceptual design, limiting the potential for discovering coupled solutions. This disjointed process constrains the design space, making it difficult to optimize both the control performance and system dynamics simultaneously. In response to this limitation, there has been growing interest in mission-specific dynamic design procedures, which aim to address specific operational challenges by integrating control and design early in the development process. The multi-disciplinary approach of control co-design (CCD) expands the design space by solving control and system design problems concurrently. The recently introduced DAIMYO framework demonstrated that combining multi-fidelity modelling with a nested CCD approach can tackle the sim-to-real gap. However, DAIMYO’s reliance on Bayesian optimization to account for the computational cost increase of a nested formulation limits its scalability. To address these issues, we propose KUGE, a simultaneous CCD strategy that reduces computational complexity and overcomes dimensionality restrictions through a combined effort of stochastic optimization and Gaussian processes. We validate the effectiveness of KUGE by applying it to the dynamic design of a tail-sitter, showing that it is competitive with the DAIMYO architecture while offering greater computational efficiency.

11:25-11:30, Paper WeCT18.3	Add to My Program
HEROES: Unreal Engine-Based Human and Emergency Robot Operation Education System

Chaudhary, Anav	Purdue University
Tiwari, Kshitij	Purdue University
Bera, Aniket	Purdue University
Keywords: Simulation and Animation, Planning under Uncertainty, Task and Motion Planning Abstract: Training and preparing first responders and humanitarian robots for Mass Casualty Incidents (MCIs) often poses a challenge owing to the lack of realistic and easily accessible test facilities. While such facilities can offer realistic scenarios post an MCI that can serve training and educational purposes for first responders and humanitarian robots, they are often hard to access owing to logistical constraints. To overcome this challenge, we present HEROES- a versatile Unreal Engine-based simulator for designing novel training simulations for humans and emergency robots for such urban search and rescue operations. The proposed HEROES simulator is capable of generating synthetic datasets for machine learning pipelines that are used for training robot navigation. This work addresses the necessity for a comprehensive training platform in the robotics community, ensuring pragmatic and efficient preparation for real-world emergency scenarios. The strengths of our simulator lie in its adaptability, scalability, and ability to facilitate collaboration between robot developers and first responders, fostering synergy in developing effective strategies for search and rescue operations in MCIs. We conducted a preliminary user study with an average score of 8.1 out of 10 supporting the ability of HEROES to generate sufficiently varied environments and a score of 7.8 out of 10 affirming the usefulness of the simulation environment. HEROES has been integrated with ROS and has been used to train an RL model for a real robot as a proof of concept.

11:30-11:35, Paper WeCT18.4	Add to My Program
On the Necessity of Real-Time Principles in GPU-Driven Autonomous Robots

Ali, Syed	University of North Carolina at Chapel Hill
Angelopoulos, Angelos	University of North Carolina at Chapel Hill
Massey, Denver	University of North Carolina at Chapel Hill
Haddix, Sarah Barnes	The University of North Carolina at Chapel Hill
Georgiev, Alexander	University of North Carolina at Chapel Hill
Goh, Joseph	University of North Carolina at Chapel Hill
Wagle, Rohan	University of North Carolina at Chapel Hill
Sarathy, Prakash	Northrop Grumman
Anderson, James	University of North Carolina at Chapel Hill
Alterovitz, Ron	University of North Carolina at Chapel Hill
Keywords: Software Architecture for Robotic and Automation, Software, Middleware and Programming Environments, Robot Safety Abstract: Robot autonomy is driving an ever-increasing demand for computational power, including on-board multi-core CPUs and accelerators such as GPUs, to enable fast perception, planning, control, and more. Careful scheduling of these computational tasks on the CPU cores and GPUs is important to prevent locking up the finite computational capacity in ways that hinder other critical workloads; delays in computing time-critical tasks like obstacle detection and control can have huge negative consequences for autonomous robots, potentially resulting in damage, substantial financial loss, or even loss of life. In this paper, we leverage recent advances from real-time systems research. We apply TimeWall, a component-based real-time framework, to the computational components of an autonomous drone and experimentally show that the timeliness and safe operation properties of a drone are preserved even in the presence of increasing interfering computational processes.

11:35-11:40, Paper WeCT18.5	Add to My Program
HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems

Kwok, Jacky	University of California, Berkeley
Li, Shulu	UC Berkeley, Fudan University
Lohstroh, Marten	UC Berkeley
Lee, Edward A.	UC Berkeley
Keywords: Software Architecture for Robotic and Automation, Computer Architecture for Robotic and Automation, Software, Middleware and Programming Environments Abstract: The rise of intelligent autonomous systems, especially in robotics and autonomous agents, has created a critical need for robust communication middleware that can ensure real-time processing of extensive sensor data. Current robotics middleware like Robot Operating System (ROS) 2 faces challenges with nondeterminism and high communication latency when dealing with large data across multiple subscribers on a multi-core compute platform. To address these issues, we present High-Performance Robotic Middleware (HPRM), built on top of the deterministic coordination language Lingua Franca (LF). HPRM employs optimizations including an in-memory object store for efficient zero-copy transfer of large payloads, adaptive serialization to minimize serialization overhead, and an eager protocol with real-time sockets to reduce handshake latency. Benchmarks show HPRM achieves up to 114x lower latency than ROS2 when broadcasting large messages to multiple nodes. We then demonstrate the benefits of HPRM by integrating it with the CARLA simulator and running reinforcement learning agents along with object detection workloads. In the CARLA autonomous driving application, HPRM attains 91.1% lower latency than ROS2. The deterministic coordination semantics of HPRM, combined with its optimized IPC mechanisms, enable efficient and predictable real-time communication for intelligent autonomous systems. Code and videos can be found on our project page: https://hprm-robotics.github.io/HPRM

11:40-11:45, Paper WeCT18.6	Add to My Program
CusADi: A GPU Parallelization Framework for Symbolic Expressions and Optimal Control

Jeon, Se Hwan	Massachusetts Institute of Technology
Hong, Seungwoo	MIT (Massachusetts Institute of Technology)
Lee, Ho Jae	Massachusetts Institute of Technology
Khazoom, Charles	Massachusetts Institute of Technology
Kim, Sangbae	Massachusetts Institute of Technology
Keywords: Software Tools for Robot Programming, Optimization and Optimal Control, Reinforcement Learning Abstract: The parallelism afforded by GPUs presents significant advantages in training controllers through reinforcement learning (RL). However, integrating model-based optimization into this process remains challenging due to the complexity of formulating and solving optimization problems across thousands of instances. In this work, we present CusADi, an extension of the CasADi symbolic framework to support the parallelization of arbitrary closed-form expressions on GPUs with CUDA. We also formulate a closed-form approximation for solving general optimal control problems, enabling large-scale parallelization and evaluation of MPC controllers. Our results show a ten-fold speedup relative to similar MPC implementation on the CPU, and we demonstrate the use of CusADi for various applications, including parallel simulation, parameter sweeps, and policy training.


WeCT19 Regular Session, 407	Add to My Program
System Design

Chair: Roberts, Rodney	Florida State University
Co-Chair: Wen, John	Rensselaer Polytechnic Institute

11:15-11:20, Paper WeCT19.1	Add to My Program
Learning Optimal Design Manifolds to Design More Practical Robotic Systems

Baumgärtner, Jan	Karlsruhe Institute of Technology
Puchta, Alexander	Karlsruhe Institute of Technology
Fleischer, Jürgen	Karlsruhe Institute of Technology (KIT)
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Representation Learning Abstract: This paper introduces the optimal design manifold as a novel approach for understanding and optimizing the design of robotic systems. Existing optimization frameworks often jointly optimize design and behavior but lack insight into why specific designs are optimal for given tasks. Additionally, a functionally optimal design may not always be the most practical to build and practicality cannot always be captured by an objective function. By defining and learning the optimal design manifold, which represents the space of all optimal solutions, we provide a systematic method for exploring the design space and selecting the most practical optimal design. We apply the optimal design manifold to robot cell layout optimization, robot design optimization, and multi-camera placement and demonstrate its effectiveness in enhancing design choices by enabling a deeper understanding of what makes a design optimal.

11:20-11:25, Paper WeCT19.2	Add to My Program
Monotone Subsystem Decomposition for Efficient Multi-Objective Robot Design

Wilhelm, Andrew	Cornell University
Napp, Nils	Cornell University
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Formal Methods in Robotics and Automation Abstract: Automating design minimizes errors, accelerates the design process, and reduces cost. However, automating robot design is challenging due to recursive constraints, multiple design objectives, and cross-domain design complexity possibly spanning multiple abstraction layers. Here we look at the problem of component selection, a combinatorial optimization problem in which a designer, given a robot model, must select compatible components from an extensive catalog. The goal is to satisfy high-level task specifications while optimally balancing trade-offs between competing design objectives. In this paper, we extend our previous constraint programming approach to multi-objective design problems and propose the novel technique of monotone subsystem decomposition to efficiently compute a Pareto front of solutions for large-scale problems. We prove that subsystems can be optimized for their Pareto fronts and, under certain conditions, these results can be used to determine a globally optimal Pareto front. Furthermore, subsystems serve as an intuitive design abstraction and can be reused across various design problems. Using an example quadcopter design problem, we compare our method to a linear programming approach and demonstrate our method scales better for large catalogs, solving a multi-objective problem of 10^25 component combinations in seconds. We then expand the original problem and solve a task-oriented, multi-objective design problem to build a fleet of quadcopters to deliver packages. We compute a Pareto front of solutions in seconds where each solution contains an optimal component-level design and an optimal package delivery schedule for each quadcopter.

11:25-11:30, Paper WeCT19.3	Add to My Program
Robust Reinforcement Learning-Based Locomotion for Resource-Constrained Quadrupeds with Exteroceptive Sensing

Plozza, Davide	ETH Zürich
Apostol, Patricia	ETH Zürich
Joseph, Paul	ETH Zürich
Schläpfer, Simon	ETH Zurich
Magno, Michele	ETH Zurich
Keywords: Engineering for Robotic Systems, Legged Robots, Reinforcement Learning Abstract: Compact quadrupedal robots are proving increasingly suitable for deployment in real-world scenarios. Their smaller size fosters easy integration into human environments. Nevertheless, real-time locomotion on uneven terrains remains challenging, particularly due to the high computational demands of terrain perception. This paper presents a robust reinforcement learning-based exteroceptive locomotion controller for resource-constrained small-scale quadrupeds in challenging terrains, which exploits real-time elevation mapping, supported by a careful depth sensor selection. We concurrently train both a policy and a state estimator, which together provide an odometry source for elevation mapping, optionally fused with visual-inertial odometry (VIO). We demonstrate the importance of positioning an additional time-of-flight sensor for maintaining robustness even without VIO, thus having the potential to free up computational resources. We experimentally demonstrate that the proposed controller can flawlessly traverse steps up to 17.5 cm in height and achieve an 80% success rate on 22.5 cm steps, both with and without VIO. The proposed controller also achieves accurate forward and yaw velocity tracking of up to 1.0 m/s and 1.5 rad/s respectively. We open-source our training code at github.com/ETH-PBL/elmap-rl-controller.

11:30-11:35, Paper WeCT19.4	Add to My Program
AeroSafe: Mobile Indoor Air Purification Using Aerosol Residence Time Analysis and Robotic Cough Emulator Testbed

Tonmoy, Tanjid	University of California San Diego
Malladi, Rahath	Plaksha University
Singh, Kaustubh	Plaksha University
Forsad, Al Hossain	University of Massachusetts
Gupta, Rajesh Kumar	Halicioglu Data Science Institute, UC San Diego
Martinez, Andres Tejada	University of Florida
Rahman, Tauhidur	University of California San Diego
Keywords: Software-Hardware Integration for Robot Systems, Deep Learning Methods, Sensor-based Control Abstract: Indoor air quality plays an essential role in the safety and well-being of occupants, especially in the context of airborne diseases. This paper introduces AeroSafe, a novel approach aimed at enhancing the efficacy of indoor air purification systems through a robotic cough emulator testbed and a digital-twins-based aerosol residence time analysis. Current portable air filters often overlook the concentrations of respiratory aerosols generated by coughs, posing a risk, particularly in high-exposure environments like healthcare facilities and public spaces. To address this gap, we present a robotic dual-agent physical emulator comprising a manoeuvrable mannequin simulating cough events and a portable air purifier autonomously responding to aerosols. The generated data from this emulator trains a digital twins model, combining a physics-based compartment model with a machine learning approach, using Long Short-Term Memory (LSTM) networks and graph convolution layers. Experimental results demonstrate the model's ability to predict aerosol concentration dynamics with a mean residence time prediction error within 35 seconds. The proposed system's real-time intervention strategies outperform static air filter placement, showcasing its potential in mitigating airborne pathogen risks.

11:35-11:40, Paper WeCT19.5	Add to My Program
Remote Inspection Techniques: A Review of Autonomous Robotic Inspection for Marine Vessels (I)

Andersen, Rasmus Eckholdt	Technicel University of Denmark
Brogaard, Rune Y.	Explicit Aps
Boukas, Evangelos	Technical University of Denmark
Keywords: Field Robots, Aerial Systems: Applications, Deep Learning Methods Abstract: Due to the harsh environment and heavy use that modern marine vessels are subjected to, they are required to undergo periodic inspections to determine their current condition. The use of autonomous remote inspection systems can alleviate some of the dangers and shortcomings associated with manual inspection. While there has been research on the use of robotic platforms, none of the works in the literature evaluates the current state of the art with respect to the specifications of the classification societies, who are the most important stakeholders among the end users. The aim of this paper is to provide an overview of the existing literature and evaluate the works individually in collaboration with classification societies. The papers included in this review are either directly developed for, or have properties potentially transferable to, the marine vessel inspection process. To structure the review, an expertise-engineering separation is proposed based on the contributions of the individual paper. This separation shows which part of the inspection process has received the most attention, as well as where the shortcomings of each approach lay. The findings indicate that while there are promising approaches, there is still a gap between the classification societies’ requirements and the state of the art. Our results indicate that, there is quality work in the literature, but there is a lack of integrated development activities that achieve sufficient completeness.

11:40-11:45, Paper WeCT19.6	Add to My Program
Toward Fully Automated Aviation: PIBOT, a Humanoid Robot Pilot, for Human-Centric Aircraft Cockpits

Min, Sungjae	Korea Advanced Institute of Science and Technology (KAIST)
Kang, Gyuree	Korea Advanced Institute of Science and Technology (KAIST)
Kim, Hyungjoo	Korea Advanced Institute of Science and Technology (KAIST)
Shim, David Hyunchul	KAIST
Keywords: Humanoid Robot Systems, AI-Enabled Robotics, Engineering for Robotic Systems Abstract: Humanoid robots have been considered ideal for automating daily tasks, though most research has centered on bipedal locomotion. Many activities we do routinely, such as driving a car, require real-time system manipulation as well as substantial field-specific knowledge. Recent breakthroughs in natural language processing, particularly with large language models (LLMs), are empowering humanoid robots to access and process vast information sources and operate systems with an unprecedented level of autonomy. This article introduces PIBOT, a humanoid robot that can pilot unmodified general aviation (GA) aircraft, physically manipulating instruments while following strict rules of the air and verbally communicating with copilots and air traffic controllers (ATCs). Building on these capabilities, we developed an LLM-based task planner that interprets natural language commands, translating them into action sequences. Then, the behavior decision module breaks tasks into precise limb movements, enabling humanlike control of cockpit instruments. In a series of rigorous simulations, PIBOT demonstrates its capabilities to successfully take off and land an airplane from a cold-and-dark start, showcasing its potential for a fully autonomous robot pilot.


WeCT20 Regular Session, 408	Add to My Program
Human-Aware Robot Motion

Chair: Murphey, Todd	Northwestern University
Co-Chair: Carlone, Luca	Massachusetts Institute of Technology

11:15-11:20, Paper WeCT20.1	Add to My Program
Sampling-Based Grasp and Collision Prediction for Assisted Teleoperation

Manschitz, Simon	Honda Research Institute Europe
Güler, Berk	TU Darmstadt
Ma, Wei	Honda Research Institute Europe
Ruiken, Dirk	Honda Research Institute Europe
Keywords: Telerobotics and Teleoperation Abstract: Shared autonomy allows for combining the global planning capabilities of a human operator with the strengths of a robot such as repeatability and accurate control. In a real-time teleoperation setting, one possibility for shared autonomy is to let the human operator decide for the rough movement and to let the robot do fine adjustments, e.g., when the view of the operator is occluded. We present a learning-based concept for shared autonomy that aims at supporting the human operator in a real-time teleoperation setting. At every step, our system tracks the target pose set by the human operator as accurately as possible while at the same time satisfying a set of constraints which influence the robot’s behavior. An important characteristic is that the constraints can be dynamically activated and deactivated which allows the system to provide task-specific assistance. Since the system must generate robot commands in real-time, solving an optimization problem in every iteration is not feasible. Instead, we sample potential target configurations and use Neural Networks for predicting the constraint costs for each configuration. By evaluating each configuration in parallel, our system is able to select the target configuration which satisfies the constraints and has the minimum distance to the operator’s target pose with minimal delay. We evaluate the framework with a pick and place task on a bi-manual setup with two Franka Emika Panda robot arms with Robotiq grippers.

11:20-11:25, Paper WeCT20.2	Add to My Program
Inverse Mixed Strategy Games with Generative Trajectory Models

Sun, Max Muchen	Northwestern University
Trautman, Peter	Honda Research Institute
Murphey, Todd	Northwestern University
Keywords: Human-Aware Motion Planning, Path Planning for Multiple Mobile Robots or Agents, Probabilistic Inference Abstract: Game-theoretic models are effective tools for modeling multi-agent interactions, especially when robots need to coordinate with humans. However, applying these models requires inferring their specifications from observed behaviors---a challenging task known as the inverse game problem. Existing inverse game approaches often struggle to account for behavioral uncertainty and measurement noise, and leverage both offline and online data. To address these limitations, we propose an inverse game method that integrates a generative trajectory model into a differentiable mixed-strategy game framework. By representing the mixed strategy with a conditional variational autoencoder (CVAE), our method can infer high-dimensional, multi-modal behavior distributions from noisy measurements while adapting in real-time to new observations. We extensively evaluate our method in a simulated navigation benchmark, where the observations are generated by an unknown game model. Despite the model mismatch, our method can infer Nash-optimal actions comparable to those of the ground-truth model and the oracle inverse game baseline, even in the presence of uncertain agent objectives and noisy measurements.

11:25-11:30, Paper WeCT20.3	Add to My Program
AToM: Adaptive Theory-Of-Mind-Based Human Motion Prediction in Long-Term Human-Robot Interactions

Liao, Yuwen	Nanyang Technological University
Cao, Muqing	Carnegie Mellon University
Xu, Xinhang	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Social HRI, Intention Recognition, Long term Interaction Abstract: Humans learn from observations and experiences to adjust their behaviours towards better performance. Interacting with such dynamic humans is challenging, as the robot needs to predict the humans accurately for safe and efficient operations. Long-term interactions with dynamic humans have not been extensively studied by prior works. We propose an adaptive human prediction model based on the Theory-of-Mind (ToM), a fundamental social-cognitive ability that enables humans to infer others’ behaviours and intentions. We formulate the human internal belief about others using a game-theoretic model, which predicts the future motions of all agents in a navigation scenario. To estimate an evolving belief, we use an Unscented Kalman Filter to update the behavioural parameters in the human internal model. Our formulation provides unique interpretability to dynamic human behaviours by inferring how the human predicts the robot. We demonstrate through long-term experiments in both simulations and real-world settings that our prediction effectively promotes safety and efficiency in downstream robot planning. Code will be available at https://github.com/centiLinda/AToM-human-prediction.git.

11:30-11:35, Paper WeCT20.4	Add to My Program
Learning Dynamic Weight Adjustment for Spatial-Temporal Trajectory Planning in Crowd Navigation

Cao, Muqing	Carnegie Mellon University
Xu, Xinhang	Nanyang Technological University
Yang, Yizhuo	Nangyang Technological Univercity
Li, Jianping	Nanyang Technological University
Jin, Tongxing	Nanyang Technological University
Wang, Pengfei	Nanyang Technological University
Hung, Tzu-Yi	Delta Electronics
Lin, Guosheng	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Human-Aware Motion Planning, Motion and Path Planning, Reinforcement Learning Abstract: Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectives to achieve a balance among safety, efficiency, and goal achievement in complex and dynamic environments. We design the network structure, observation encoding, and reward function to effectively train the policy network using reinforcement learning, allowing the robot to adapt its behavior in real time based on environmental and pedestrian information. Simulation results show improved safety compared to the fixed-weight planner and the state-of-the-art learning-based methods, and verify the ability of the learned policy to adaptively adjust the weights based on the observed situations. The feasibility of the approach is demonstrated in a navigation task using an autonomous delivery robot across a crowded corridor over a 300 m distance.

11:35-11:40, Paper WeCT20.5	Add to My Program
COLLAGE: COLLAborative Human-Agent Interaction Generation Using Hierarchical Latent Diffusion and Language Models

Daiya, Divyanshu	Purdue University
Conover, Damon	DEVCOM Army Research Laboratory
Bera, Aniket	Purdue University
Keywords: Human and Humanoid Motion Analysis and Synthesis, Motion and Path Planning, Modeling and Simulating Humans Abstract: We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets demonstrate the effectiveness of our approach in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods. Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.

11:40-11:45, Paper WeCT20.6	Add to My Program
Long-Term Human Trajectory Prediction Using 3D Dynamic Scene Graphs

Gorlo, Nicolas	Massachusetts Institute of Technology
Schmid, Lukas M.	Massachusetts Institute of Technology (MIT)
Carlone, Luca	Massachusetts Institute of Technology
Keywords: Human and Humanoid Motion Analysis and Synthesis, Datasets for Human Motion, Long term Interaction Abstract: We present a novel approach for long-term human trajectory prediction in indoor human-centric environments, which is essential for long-horizon robot planning in these environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged (i.e., evaluated in a zero-shot fashion on the dataset) baselines for a time horizon of 60s.


WeCT21 Regular Session, 410	Add to My Program
Robot Foundation Models 2

Chair: Posner, Ingmar	Oxford University
Co-Chair: Zhu, Yuke	The University of Texas at Austin

11:15-11:20, Paper WeCT21.1	Add to My Program
LUMOS: Language-Conditioned Imitation Learning with World Models

Nematollahi, Iman	University of Freiburg
DeMoss, Branton	University of Oxford
L Chandra, Akshay	University of Freiburg
Hawes, Nick	University of Oxford
Burgard, Wolfram	University of Technology Nuremberg
Posner, Ingmar	Oxford University
Keywords: Imitation Learning, Reinforcement Learning Abstract: We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.

11:20-11:25, Paper WeCT21.2	Add to My Program
LIMT: Language-Informed Multi-Task Visual World Models

Aljalbout, Elie	University of Zurich
Sotirakis, Nikolaos	Technical University of Munich
van der Smagt, Patrick	Volkswagen Group
Karl, Maximilian	Foundation Robotics Labs
Chen, Nutan	Volkswagen Group
Keywords: Reinforcement Learning, Representation Learning, Machine Learning for Robot Control Abstract: Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.

11:25-11:30, Paper WeCT21.3	Add to My Program
Towards Robust Autonomous Driving: Conditional Multimodal Large Language Models for Fine-Grained Perception

Sun, Fengzhao	University of Science and Technology of China
Yu, Jun	University of Science and Technology of China
Zhang, Yunxiang	University of Science and Technology of China
Hou, Jiaming	Harbin Institute of Technology
Lu, Xilong	University of Science and Technology
Song, Heng	China Railway No.4 Engineering Group Co., Ltd
Gao, Fang	Guangxi University
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, AI-Based Methods Abstract: Multimodal large language models (MLLMs) have shown remarkable performance across various visual understanding tasks. However, most existing MLLMs still lack image detail perception, limiting their effectiveness in tasks that require detailed visual information. In this paper, we introduce Percept-DriveLM, a novel MLLM designed to tackle the fine-grained perception challenges in autonomous driving tasks. At the core of our model is the Visual Fusion Module, which integrates several innovative components: a dynamic resolution mechanism that combines both high and low resolution features, and an RoI conditional mechanism to incorporate object/region-level features identified by offline detectors, further refining the model's fine-grained perception abilities. Trained in a two-stage process, our model demonstrates exceptional performance, outperforming existing MLLMs with comparable parameter sizes and excelling in both autonomous driving perception and general vision-language tasks. The effectiveness of our approach is validated through extensive empirical studies. Code will be available at https://github.com/DebuggerSunfz/PerceptDriveLM.

11:30-11:35, Paper WeCT21.4	Add to My Program
Automated Hybrid Reward Scheduling Via Large Language Models for Robotic Skill Learning

Huang, Changxin	Shenzhen University
Liang, Junyang	Shenzhen University
Chang, Yanbin	Shenzhen University
Xu, Jingzhao	Shenzhen University
Li, Jianqiang	Shenzhen University,
Keywords: Reinforcement Learning, Machine Learning for Robot Control Abstract: Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot’s learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.

11:35-11:40, Paper WeCT21.5	Add to My Program
RT-Affordance: Affordances Are Versatile Intermediate Representations for Robot Manipulation

Nasiriany, Soroush	The University of Austin at Texas
Kirmani, Sean	Google DeepMind
Ding, Tianli	Google
Smith, Laura	UC Berkeley
Zhu, Yuke	The University of Texas at Austin
Driess, Danny	TU Berlin
Sadigh, Dorsa	Stanford University
Xiao, Ted	Google DeepMind
Keywords: Imitation Learning, Big Data in Robotics and Automation, Deep Learning Methods Abstract: We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance

11:40-11:45, Paper WeCT21.6	Add to My Program
A Real-To-Sim-To-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Patel, Shivansh	University of Illinois Urbana Champaign
Yin, Xinchen	University of Illinois Urbana Champaign
Huang, Wenlong	Stanford University
Garg, Shubham	Amazon
Nayyeri, Hooshang	Amazon
Fei-Fei, Li	Stanford University
Lazebnik, Svetlana	University of Illinois
Li, Yunzhu	Columbia University
Keywords: Machine Learning for Robot Control, Sensorimotor Learning, Deep Learning in Grasping and Manipulation Abstract: Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.


WeCT22 Regular Session, 411	Add to My Program
Imitation Learning 2

Chair: Johns, Edward	Imperial College London
Co-Chair: Kaelbling, Leslie	MIT

11:15-11:20, Paper WeCT22.1	Add to My Program
Learning Task Specifications from Demonstrations As Probabilistic Automata

Baert, Mattijs	Ghent University
Leroux, Sam	Ghent University
Simoens, Pieter	Ghent University - Imec
Keywords: Learning from Demonstration, Imitation Learning, Task Planning Abstract: Specifying tasks for robotic systems traditionally requires coding expertise, deep domain knowledge, and significant time investment. While learning from demonstration offers a promising alternative, existing methods often struggle with tasks of longer horizons. To address this limitation, we introduce a computationally efficient approach for learning probabilistic deterministic finite automata (PDFA) that capture task structures and expert preferences directly from demonstrations. Our approach infers sub-goals and their temporal dependencies, producing an interpretable task specification that domain experts can easily understand and adjust. We validate our method through experiments involving object manipulation tasks, showcasing how our method enables a robot arm to effectively replicate diverse expert strategies while adapting to changing conditions.

11:20-11:25, Paper WeCT22.2	Add to My Program
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments

Etukuru, Haritheja	New York University
Naka, Norihito	New York University
Hu, Zijin	New York University
Lee, Seungjae	Seoul National University
Mehu, Julian	Hello Robot Inc
Edsinger, Aaron	Hello Robot
Paxton, Chris	Meta AI
Chintala, Soumith	Facebook AI Research
Pinto, Lerrel	New York University
Shafiullah, Nur Muhammad (Mahi)	New York University
Keywords: Imitation Learning, Big Data in Robotics and Automation, Learning from Demonstration Abstract: Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this work, we present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we develop new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We train five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. Primary among our lessons are the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments.

11:25-11:30, Paper WeCT22.3	Add to My Program
R+X: Retrieval and Execution from Everyday Human Videos

Papagiannis, Georgios	Imperial College London
Di Palo, Norman	Imperial College London
Vitiello, Pietro	Imperial College London
Johns, Edward	Imperial College London
Keywords: Learning from Demonstration, Imitation Learning, Continual Learning Abstract: We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Appendix and videos are available at https://www.robot-learning.uk/r-plus-x.

11:30-11:35, Paper WeCT22.4	Add to My Program
ARCap: Collecting High-Quality Human Demonstrations for Robot Learning with Augmented Reality Feedback

Chen, Sirui	Stanford University
Wang, Chen	Stanford University
Nguyen, Kaden	Stanford University
Fei-Fei, Li	Stanford University
Liu, Karen	Stanford University
Keywords: Imitation Learning, Virtual Reality and Interfaces, Dexterous Manipulation Abstract: Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details can be found on our website: href{https://stanford-tml.github.io/ARCap}{stanford-tml.gi thub.io/ARCap}

11:35-11:40, Paper WeCT22.5	Add to My Program
XMoP: Whole-Body Control Policy for Zero-Shot Cross-Embodiment Neural Motion Planning

Rath, Prabin Kumar	Arizona State University
Gopalan, Nakul	Arizona State University
Keywords: Learning from Demonstration, Whole-Body Motion Planning and Control, Collision Avoidance Abstract: Classical manipulator motion planners work across different robot embodiments. However they plan on a pre-specified static environment representation, and are not scalable to unseen dynamic environments. Neural Motion Planners (NMPs) are an appealing alternative to conventional planners as they incorporate different environmental constraints to learn motion policies directly from raw sensor observations. Contemporary state-of-the-art NMPs can successfully plan across different environments. However none of the existing NMPs generalize across robot embodiments. In this paper we propose Cross-Embodiment Motion Policy (XMoP), a neural policy for learning to plan over a distribution of manipulators. XMoP implicitly learns to satisfy kinematic constraints for a distribution of robots and zero-shot transfers the planning behavior to unseen robotic manipulators within this distribution. We achieve this generalization by formulating a whole-body control policy that is trained on planning demonstrations from over three million procedurally sampled robotic manipulators in different simulated environments. Despite being completely trained on synthetic embodiments and environments, our policy exhibits strong sim-to-real generalization across manipulators with different kinematic variations and degrees of freedom with a single set of frozen policy parameters. We evaluate XMoP on 7 commercial manipulators and show successful cross-embodiment motion planning, achieving an average 70% success rate on baseline benchmarks. Furthermore, we demonstrate sim-to-real deployment on two unseen manipulators solving novel planning problems across three real-world domains even with dynamic obstacles.

11:40-11:45, Paper WeCT22.6	Add to My Program
KALM: Keypoint Abstraction Using Large Models for Object-Relative Imitation Learning

Fang, Xiaolin	MIT
Huang, Bo-Ruei	National Taiwan University
Mao, Jiayuan	MIT
Shone, Jasmine	MIT
Tenenbaum, Joshua	Massachusetts Institute of Technology
Lozano-Perez, Tomas	MIT
Kaelbling, Leslie	MIT
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation, Imitation Learning Abstract: Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Videos can be found at https://kalm-il.github.io/.


WeCT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 5

Chair: Chun, Il Yong	Sungkyunkwan University
Co-Chair: Baca, José	Texas A&M University-Corpus Christi

11:15-11:20, Paper WeCT23.1	Add to My Program
AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction

Khan, Mustafa	University of Toronto
Fazlali, Hamidreza	Noah's Ark Lab
Sharma, Dhruv	Huawei Research Canada
Cao, Tongtong	Noah's Ark Lab, Huawei Technologies
Bai, Dongfeng	Noah's Ark Lab, Huawei Technologies
Ren, Yuan	Noah's Ark Lab, Huawei Technologies Canada Inc
Liu, Bingbing	Huawei Technologies
Keywords: Autonomous Agents, Simulation and Animation, AI-Enabled Robotics Abstract: Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse camera views. We propose AutoSplat, a framework employing Gaussian splatting to realistically reconstruct autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios, including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate temporally-dependent residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Our project page can be found here: https://autosplat.github.io/

11:20-11:25, Paper WeCT23.2	Add to My Program
Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving

Wang, Yunshen	Beijing University of Posts and Telecommunications
Liu, Yicheng	Tsinghua University
Yuan, Tianyuan	Tsinghua University
Mao, Yucheng	University of Science and Techonology Beijing
Liang, Yingshi	Beijing University of Posts and Telecommunications
Yang, Xiuyu	Tsinghua University
Zhang, Honggang	Beijing University of Posts and Telecommunications
Zhao, Hang	Tsinghua University
Keywords: Autonomous Agents, Deep Learning for Visual Perception, Semantic Scene Understanding Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.

11:25-11:30, Paper WeCT23.3	Add to My Program
Interactive4D: Interactive 4D LiDAR Segmentation

Fradlin, Ilya	Rwth Aachen
Zulfikar, Idil Esen	RWTH Aachen
Yilmaz, Kadir	RWTH Aachen University
Kontogianni, Theodora	ETH Zurich
Leibe, Bastian	RWTH Aachen University
Keywords: Object Detection, Segmentation and Categorization, Human-Robot Collaboration, Deep Learning for Visual Perception Abstract: Interactive segmentation has an important role in facilitating the annotation process of future LiDAR datasets. Existing approaches sequentially segment individual objects at each LiDAR scan, repeating the process throughout the entire sequence, which is redundant and ineffective. In this work, we propose interactive 4D segmentation, a new paradigm that allows segmenting multiple objects on multiple LiDAR scans simultaneously, and Interactive4D, the first interactive 4D segmentation model that segments multiple objects on superimposed consecutive LiDAR scans in a single iteration by utilizing the sequential nature of LiDAR data. While performing interactive segmentation, our model leverages the entire space- time volume, leading to more efficient segmentation. Operating on the 4D volume, it directly provides consistent instance IDs over time and also simplifies tracking annotations. Moreover, we show that click simulations are crucial for successful model training on LiDAR point clouds. To this end, we design a click simulation strategy that is better suited for the char- acteristics of LiDAR data. To demonstrate its accuracy and effectiveness, we evaluate Interactive4D on multiple LiDAR datasets, where Interactive4D achieves a new state-of-the-art by a large margin. We publicly release the code and models at https://vision.rwth-aachen.de/Interactive4D.

11:30-11:35, Paper WeCT23.4	Add to My Program
Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

Lin, Chun-Jung	The University of Adelaide
Garg, Sourav	University of Adelaide
Chin, Tat-Jun	The University of Adelaide
Dayoub, Feras	The University of Adelaide
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Environment Monitoring and Management Abstract: We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) “freeze” the backbone in order to retain the generality of dense foundation features, and b) employ “full-image” cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method’s superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-Scene-Change-Detection.

11:35-11:40, Paper WeCT23.5	Add to My Program
LaB-CL: Localized and Balanced Contrastive Learning for Improving Parking Slot Detection

Jeong, U Jin	Sungkyunkwan University
Roh, Sumin	Sungkyunkwan University
Chun, Il Yong	Sungkyunkwan University
Keywords: Object Detection, Segmentation and Categorization, Representation Learning, AI-Based Methods Abstract: Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaB-CL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.

11:40-11:45, Paper WeCT23.6	Add to My Program
LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction Using LiDAR and Camera

Ma, Yukai	Zhejiang University
Mei, Jianbiao	Zhejiang University
Yang, Xuemeng	Shanghai Artificial Intelligence Laboratory
Wen, Licheng	Shanghai AI Laboratory
Xu, Weihua	Zhejiang University
Zhang, Jiangning	Zhejiang University
Zuo, Xingxing	Caltech
Shi, Botian	Shanghai AI Laboratory
Liu, Yong	Zhejiang University
Keywords: AI-Enabled Robotics, Sensor Fusion, Deep Learning for Visual Perception Abstract: Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively. The project page is available at url{https://hr-zju.github.io/LiCROcc/}.


WeDT1 Regular Session, 302	Add to My Program
Autonomous Vehicles 1

Chair: Rath, Prabin Kumar	Arizona State University
Co-Chair: Li, Xiaopeng	University of Wisconsin-Madison

15:15-15:20, Paper WeDT1.1	Add to My Program
Diverse Controllable Diffusion Policy with Signal Temporal Logic

Meng, Yue	Massachusetts Institute of Technology
Fan, Chuchu	Massachusetts Institute of Technology
Keywords: Autonomous Agents, Autonomous Vehicle Navigation, Machine Learning for Robot Control Abstract: Generating realistic simulations is critical for autonomous system applications such as self-driving and human-robot interactions. However, driving simulators nowadays still have difficulty in generating controllable, diverse, and rule-compliant behaviors for road participants: Rule-based models cannot produce diverse behaviors and require careful tuning, whereas learning-based methods imitate the policy from data but are not designed to follow the rules explicitly. Besides, the real-world datasets are by nature "single-outcome", making the learning method hard to generate diverse behaviors. In this paper, we leverage Signal Temporal Logic (STL) and Diffusion Models to learn controllable, diverse, and rule-aware policy. We first calibrate the STL on the real-world data, then generate diverse synthetic data using trajectory optimization, and finally learn the rectified diffusion policy on the augmented dataset. We test on the NuScenes dataset and our approach can achieve the most diverse rule-compliant trajectories compared to other baselines, with a runtime 1/17X to the second-best approach. In closed-loop testing, our approach reaches the highest diversity, rule satisfaction rate, and the lowest collision rate. Our method can generate varied characteristics conditional on different STL parameters in testing. A case study on human-robot encounter scenarios shows our approach can generate diverse and closed-to-oracle trajectories. The annotation tool, augmented dataset, and code are available at https://github.com/mengyuest/pSTL-diffusion-policy.

15:20-15:25, Paper WeDT1.2	Add to My Program
Dual-Conditioned Temporal Diffusion Modeling for Driving Scene Generation

Bai, Xiangyu	Northeastern University
Luo, Yedi	Northeastern University
Jiang, Le	Northeastern University
Ostadabbas, Sarah	Northeastern University
Keywords: Autonomous Vehicle Navigation, Deep Learning for Visual Perception, Visual Learning Abstract: Diffusion models have proven effective at generating high-quality images from learned distributions, but their application to the temporal domain, especially for driving scenarios, remains underexplored. Our work addresses key challenges in existing simulations, such as limited data quality, diversity, and high costs, by extending diffusion models to generate realistic long driving videos. We introduce the Dual-conditioned Temporal Diffusion Model (DcTDM), an open-source method that incorporates dual conditioning to enforce temporal consistency by guiding frame transitions. Alongside DcTDM, we present DriveSceneDDM, a comprehensive driving video dataset featuring textual scene descriptions, dense depth maps, and canny edge data. We evaluate DcTDM using common video quality metrics, demonstrating its superior performance over other video diffusion models by producing long, temporally consistent driving videos up to 40s, achieving over 25% improvement in consistency and frame quality.

15:25-15:30, Paper WeDT1.3	Add to My Program
RL-OGM-Parking: Lidar OGM-Based Hybrid Reinforcement Learning Planner for Autonomous Parking

Wang, Zhitao	Shanghai Jiao Tong University
Chen, Zhe	Shanghai Jiao Tong University
Jiang, Mingyang	Shanghai Jiao Tong University
Qin, Tong	Shanghai Jiao Tong University
Yang, Ming	Shanghai Jiao Tong University
Keywords: Autonomous Vehicle Navigation, Autonomous Agents, Reinforcement Learning Abstract: Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learning-based reinforcement learning (RL) planner. A real-time LiDAR-based Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learning-based methods. The real-world experiment further validates the feasibility and efficiency of the proposed method.

15:30-15:35, Paper WeDT1.4	Add to My Program
Multi-Task Invariant Representation Imitation Learning for Autonomous Driving

Peng, Jinghan	East China Normal University
Yu, Xing	East China Normal University
Wang, Jingwen	East China Normal University
Tian, Lili	East China Normal University
Dehui, Du	East China Normal University
Keywords: Autonomous Vehicle Navigation, Imitation Learning, Representation Learning Abstract: Imitation learning is a promising approach to acquiring autonomous driving policies by mimicking human driver behaviors. However, a major drawback of existing driving policies derived from imitation learning is their proneness to capturing spurious correlations, owing to the lack of an explicit causal model. Deploying such policies in unpredictable real-world environments poses severe risks, as spurious correlations may result in flawed decisions that compromise safety. To tackle this challenge, we introduce a novel approach called Multi-Task Invariant Representation Imitation Learning (MIRIL). MIRIL combines invariant learning with imitation learning to identify cross-environment invariant causal representations from driving demonstrations in various scenarios. These representations are then fed into multiple downstream branches for multi-task learning, including policy learning, perception prediction, invariant representation learning, and transition dynamics learning. Through the multi-task learning approach, the model not only makes consistent driving decisions across different environments but also perceives the vehicle's surroundings, thereby improving adaptability and robustness in diverse driving conditions. This enables MIRIL to effectively handle a wide range of driving scenarios, ensuring safety and efficiency. Supported by clear metrics, this paper details our comprehensive experimental setup, including datasets, benchmarks, and comparative analyses, underscoring the capability of MIRIL to significantly boost system generalization and excel in decision-making significantly.

15:35-15:40, Paper WeDT1.5	Add to My Program
Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models

Xu, Tianshuo	Hongkong University of Science and Technology (Guangzhou)
Lu, Hao	HKUST-GZ
Yan, Xu	Chinese University of Hong Kong, Shenzhen
Cai, Yingjie	Huawei
Liu, Bingbing	Huawei Technologies
Chen, Yingcong	The University of Science and Technology (Guangzhou)
Keywords: Autonomous Vehicle Navigation, Big Data in Robotics and Automation, Deep Learning Methods Abstract: Large Language Models (LLMs) have made substantial advancements in the field of robotic and autonomous driving. This study presents the first Occupancy-based Large Language Model (Occ-LLM), which represents a pioneering effort to integrate LLMs with an important representation. To effectively encode occupancy as input for the LLM and address the category imbalances associated with occupancy, we propose Motion Separation Variational Autoencoder (MS-VAE). This innovative approach utilizes prior knowledge to distinguish dynamic objects from static scenes before inputting them into a tailored Variational Autoencoder (VAE). This separation enhances the model's capacity to concentrate on dynamic trajectories while effectively reconstructing static scenes. The efficacy of Occ-LLM has been validated across key tasks, including 4D occupancy forecasting, self-ego planning, and occupancy-based scene question answering. Comprehensive evaluations demonstrate that Occ-LLM significantly surpasses existing state-of-the-art methodologies, achieving gains of about 6% in Intersection over Union (IoU) and 4% in mean Intersection over Union (mIoU) for the task of 4D occupancy forecasting. These findings highlight the transformative potential of Occ-LLM in reshaping current paradigms within robotic and autonomous driving.

15:40-15:45, Paper WeDT1.6	Add to My Program
DISC: Dataset for Analyzing Driving Styles in Simulated Crashes for Mixed Autonomy

Senthil Kumar, Sandip Sharan	University of Maryland, College Park
Thalapanane, Sandeep	University of Maryland, College Park
Appiya Dilipkumar Peethambari, Guru Nandhan	University of Maryland College Park
Sri hari, Sourang	University of Maryland College Park
Zheng, Laura	University of Maryland, College Park
Lin, Ming C.	University of Maryland at College Park
Keywords: Autonomous Vehicle Navigation, Virtual Reality and Interfaces, Data Sets for Robot Learning Abstract: Handling pre-crash scenarios is still a major challenge for self-driving cars due to limited data & human-driving behavior datasets. We introduce DISC, one of the first datasets designed to capture various driving styles & behaviors in pre-crash scenarios for mixed autonomy analysis. DISC includes over 8 classes of driving styles/behaviors from hundreds of drivers navigating a simulated vehicle through a virtual city, encountering rare-event traffic scenarios. This dataset enables the classification of pre-crash human driving behaviors in unsafe conditions, supporting individualized trajectory prediction based on observed driving patterns. It offers the potential to improve autonomous vehicle safety by accounting for diverse human driving behaviors in stressful traffic & rare accident scenarios, which are otherwise difficult or risky to capture. By utilizing a VR-based driving simulator, TRAVERSE, data was collected through a driver-centric study involving human drivers encountering 12 simulated accident scenarios. This dataset fills a critical gap in human-centric driving data for rare events involving interactions with autonomous vehicles. It enables autonomous systems to better react to human drivers & optimize trajectory prediction in mixed autonomy environments involving both human-driven & self-driving cars. It includes essential data such as acceleration, braking & vehicle pose providing a foundation for machine-learning models in autonomous vehicles. In addition, individual driving behaviors are classified through a set of standardized questionnaires, carefully designed to identify & categorize driving behavior. We correlate data features with driving behaviors, showing that the simulated environment reflects real-world driving styles. DISC is the first dataset to capture how various driving styles respond to accident scenarios, offering significant potential to enhance autonomous vehicle safety and driving behavior analysis in mixed autonomy environments.

15:45-15:50, Paper WeDT1.7	Add to My Program
Real-World Automated Vehicle Longitudinal Stability Analysis: Controller Design and Field Test

Ma, Ke	University of Wisconsin-Madison
Zhang, Yuqin	Chang’an University
Zhou, Hang	University of Wisconsin-Madison
Liang, Zhaohui	University of Wisconsin Madison
Li, Xiaopeng	University of Wisconsin-Madison
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Control, Robust/Adaptive Control Abstract: Although extensive research has been conducted on modeling the stable longitudinal controller of automated vehicles (AVs) to dampen traffic oscillations, the real-world performance of these controllers in actual vehicles remains uncertain. In the operation of real-world AVs, the delay between actual dynamics and the commands prevents the controller's command from being effectively implemented to dampen traffic oscillations. Thus, this study adapts the designed controllers within an AV test platform to compare the theoretically stable conditions with the actual oscillation dampening performance. Initially, we compute the stable conditions for both the traditional car-following controller, which assumes no delay, and the longitudinal controller that accounts for the dynamic response of the vehicle. Through empirical experiments, we demonstrate that the longitudinal controller predicts vehicle stability more accurately than conventional car-following controller, showing an improvement from an average prediction accuracy rate of 0.59 to 0.91. Also, the experiments uncover specific delays inherent in dynamics systems, with a response delay of 0.34 seconds. Our work makes two principal contributions to the field of AV control systems. First, it empirically validates that the longitudinal model, which accounts for the vehicle's dynamic responses, offers a more precise representation of vehicular behavior. Second, the relatively brief response delay identified expands the stability region, thereby enhancing vehicle control and safety. The longitudinal controller is critical for enhancing AV performance and reliability in dampening traffic oscillations.


WeDT2 Regular Session, 301	Add to My Program
Learning-Based SLAM 1

Chair: Pagani, Alain	German Research Center for Artificial Intelligence
Co-Chair: McGuire, Steve	University of California at Santa Cruz

15:15-15:20, Paper WeDT2.1	Add to My Program
RoDyn-SLAM: Robust Dynamic Dense RGB-D SLAM with Neural Radiance Fields

Jiang, Haochen	Fudan University
Xu, Yueming	Fudan University
Li, Kejie	The University of Oxford
Feng, Jianfeng	Fudan University
Zhang, Li	Fudan University
Keywords: SLAM, Deep Learning Methods, Visual Learning Abstract: Leveraging neural implicit representation to conduct dense RGB-D SLAM has been studied in recent years. However, this approach relies on a static environment assumption and does not work robustly within a dynamic environment due to the inconsistent observation of geometry and photometry. To address the challenges presented in dynamic environments, we propose a novel dynamic SLAM framework with neural radiance field. Specifically, we introduce a motion mask generation method to filter out the invalid sampled rays. This design effectively fuses the optical flow mask and semantic mask to enhance the precision of motion mask. To further improve the accuracy of pose estimation, we have designed a divide-and-conquer pose optimization algorithm that distinguishes between keyframes and non-keyframes. The proposed edge warp loss can effectively enhance the geometry constraints between adjacent frames. Extensive experiments are conducted on the two challenging datasets, and the results show that RoDyn-SLAM achieves state-of-the-art performance among recent neural RGB-D methods in both accuracy and robustness.

15:20-15:25, Paper WeDT2.2	Add to My Program
HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM

Gong, Ziren	University of Bologna
Tosi, Fabio	University of Bologna
Zhang, Youmin	University of Bologna
Mattoccia, Stefano	University of Bologna
Poggi, Matteo	University of Bologna
Keywords: SLAM, Mapping, Localization Abstract: NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.

15:25-15:30, Paper WeDT2.3	Add to My Program
Gassidy: Gaussian Splatting SLAM in Dynamic Environments

Wen, Long	Technical University of Munich
Li, Shixin	Technical University of Munich
Zhang, Yu	Technical University of Munich
Huang, Yuhong	Technische Universität München
Lin, Jianjie	Technische Universität München
Pan, Fengjunjie	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: SLAM, Localization, Mapping Abstract: 3D Gaussian Splatting (3DGS) allows flexible adjustments to scene representation, enabling continuous optimization of scene quality during dense visual simultaneous localization and mapping (SLAM) in static environments. However, 3DGS faces challenges in handling environmental disturbances from dynamic objects with irregular movement, leading to degradation in both camera tracking accuracy and map reconstruction quality. To address this challenge, we develop an RGB-D dense SLAM which is called Gaussian Splatting SLAM in Dynamic Environments (Gassidy). This approach calculates Gaussians to generate rendering loss flows for each environmental component based on a designed photometric-geometric loss function. To distinguish and filter environmental disturbances, we iteratively analyze rendering loss flows to detect features characterized by changes in loss values between dynamic objects and static components. This process ensures a clean environment for accurate scene reconstruction. Compared to state-of-the-art SLAM methods, experimental results on open datasets show that Gassidy improves camera tracking precision by up to 97.9% and enhances map quality by up to 6%.

15:30-15:35, Paper WeDT2.4	Add to My Program
Large-Scale Gaussian Splatting SLAM

Xin, Zhe	Meituan
Wu, Chenyang	University of Science and Technology of China
Huang, Penghui	Meituan
Zhang, Yanyong	University of Science and Technology of China
Mao, Yinian	Meituan-Dianping Group
Huang, Guoquan (Paul)	University of Delaware
Keywords: SLAM, Mapping, Deep Learning for Visual Perception Abstract: The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches.

15:35-15:40, Paper WeDT2.5	Add to My Program
OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding

Yang, Dianyi	Beijing Institute of Technology
Gao, Yu	Beijing Institude of Technology
Wang, Xihan	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: SLAM, Semantic Scene Understanding, RGB-D Perception Abstract: Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10× faster semantic rendering and 2× lower storage costs compared to existing methods.

15:40-15:45, Paper WeDT2.6	Add to My Program
SAP-SLAM: Semantic-Assisted Perception SLAM with 3D Gaussian Splatting

Yang, Yuheng	Shenzhen International Graduate School, Tsinghua University
Lin, Yudong	Tsinghua University
Yang, Wenming	Tsinghua University
Wang, Guijin	Tsinghua University
Liao, Qingmin	Tsinghua University
Keywords: SLAM, RGB-D Perception, Mapping Abstract: The integration of 3D Gaussians has introduced a novel scene representation in Simultaneous Localization and Mapping (SLAM), characterized by explicit representation and differentiable rendering capabilities that enhance scene reconstruction and understanding. However, most current SLAM systems only exploit the basic representational capacity of 3D Gaussians, neglecting their potential to offer richer information and facilitate higher-dimensional scene comprehension. Furthermore, these systems often struggle with reconstruction when encountering rapid camera movements or depth missing. Drawing inspiration from 3D language field, which explores the intrinsic relationships among scene objects, we propose SAP-SLAM, a dense SLAM system that combines robust tracking, high-fidelity reconstruction, and advanced semantic understanding. Our approach leverages pre-trained visual models to extract semantic features, which are then fused, dimensionally reduced, and encoded into the 3D Gaussian model for optimization and rendering. The integration of these features improves the systems’ semantic comprehension and scene representation, ultimately enabling the creation of high-precision 3D semantic maps. Additionally, we introduce a semantic-guided Gaussian densification and pruning strategy, which uses semantic consistency to prioritize attention on poorly reconstructed areas, greatly improving performance in complex scenarios. SAP-SLAM achieves competitive results on both synthetic and real-world datasets, demonstrating superior capabilities in semantic understanding and reconstruction.

15:45-15:50, Paper WeDT2.7	Add to My Program
Gaussian-LIC: Real-Time Photo-Realistic SLAM with Gaussian Splatting and LiDAR-Inertial-Camera Fusion

Lang, Xiaolei	Zhejiang University
Li, Laijian	Zhejiang University
Wu, Chenming	Baidu Research
Zhao, Chen	Baidu Inc
Liu, Lina	Zhejiang University
Liu, Yong	Zhejiang University
Lv, Jiajun	Zhejiang University
Zuo, Xingxing	Caltech
Keywords: Mapping, Sensor Fusion, SLAM Abstract: In this paper, we present a real-time photo-realistic SLAM method based on marrying Gaussian Splatting with LiDAR-Inertial-Camera SLAM. Most existing radiance-field-based SLAM systems mainly focus on bounded indoor environments, equipped with RGB-D or RGB sensors. However, they are prone to decline when expanding to unbounded scenes or encountering adverse conditions, such as violent motions and changing illumination. In contrast, oriented to general scenarios, our approach additionally tightly fuses LiDAR, IMU, and camera for robust pose estimation and photo-realistic online mapping. To compensate for regions unobserved by the LiDAR, we propose to integrate both the triangulated visual points from images and LiDAR points for initializing 3D Gaussians. In addition, the modeling of the sky and varying camera exposure have been realized for high-quality rendering. Notably, we implement our system purely with C++ and CUDA, and meticulously design a series of strategies to accelerate the online optimization of the Gaussian-based scene representation. Extensive experiments demonstrate that our method outperforms its counterparts while maintaining real-time capability. Impressively, regarding photo-realistic mapping, our method with our estimated poses even surpasses all the compared approaches that utilize privileged ground-truth poses for mapping. Our code will be released on project page https://xingxingzuo.github.io/gaussian_lic.


WeDT3 Regular Session, 303	Add to My Program
Planning for Autonomous Racing

Chair: Miao, Fei	University of Connecticut
Co-Chair: Laine, Forrest	Vanderbilt University

15:15-15:20, Paper WeDT3.1	Add to My Program
Risk-Averse Model Predictive Control for Racing in Adverse Conditions

Lew, Thomas	Toyota Research Institute
Greiff, Marcus	Toyota Research Institute
Djeumou, Franck	University of Texas, Austin
Suminaka, Makoto	Toyota Research Institute
Thompson, Michael	Toyota Research Institute
Subosits, John	Toyota Research Institute
Keywords: Planning under Uncertainty, Optimization and Optimal Control, Robot Safety Abstract: Model predictive control (MPC) algorithms can be sensitive to model mismatch when used in challenging nonlinear control tasks. In particular, the performance of MPC for vehicle control at the limits of handling suffers when the underlying model overestimates the vehicle’s performance capability. In this work, we propose a risk-averse MPC framework that explicitly accounts for uncertainty over friction limits and tire parameters. Our approach leverages a sample-based approximation of an optimal control problem with a conditional value at risk (CVaR) constraint. This sample-based formulation enables planning with a set of expressive vehicle dynamics models using different tire parameters. Moreover, this formulation enables efficient numerical resolution via sequential quadratic programming and GPU parallelization. Experiments on a Lexus LC 500 show that risk-averse MPC unlocks reliable performance, while a deterministic baseline that plans using a single dynamics model may lose control of the vehicle in adverse road conditions.

15:20-15:25, Paper WeDT3.2	Add to My Program
Kineto-Dynamical Planning and Accurate Execution of Minimum-Time Maneuvers on Three-Dimensional Circuits

Piccinini, Mattia	Technical University of Munich
Taddei, Sebastiano	University of Trento, Politecnico Di Bari
Betz, Johannes	Technical University of Munich
Biral, Francesco	University of Trento
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Optimization and Optimal Control Abstract: Online planning and execution of minimum-time maneuvers on three-dimensional (3D) circuits is an open challenge in autonomous vehicle racing. In this paper, we present an artificial race driver (ARD) to learn the vehicle dynamics, plan and execute minimum-time maneuvers on a 3D track. ARD integrates a novel kineto-dynamical (KD) vehicle model for trajectory planning with economic nonlinear model predictive control (E-NMPC). We use a high-fidelity vehicle simulator (VS) to compare the closed-loop ARD results with a minimum-lap-time optimal control problem (MLT-VS), solved offline with the same VS. Our ARD sets lap times close to the MLT-VS, and the new KD model outperforms a literature benchmark. Finally, we study the vehicle trajectories, to assess the re-planning capabilities of ARD under execution errors. A video with the main results is available as supplementary material.

15:25-15:30, Paper WeDT3.3	Add to My Program
Safety Guaranteed Robust Multi-Agent Reinforcement Learning with Hierarchical Control for Connected and Automated Vehicles

Zhang, Zhili	University of Connecticut
Ahmad, H M Sabbir	Boston University
Sabouni, Ehsan	Boston University
Sun, Yanchao	JPMorgan Chase
Huang, Furong	University of Maryland
Li, Wenchao	Boston University
Miao, Fei	University of Connecticut
Keywords: Integrated Planning and Control, Reinforcement Learning, Planning under Uncertainty Abstract: We address the problem of coordination and control of Connected and Automated Vehicles (CAVs) in the presence of imperfect observations in mixed traffic environment. A commonly used approach is learning-based decision-making, such as reinforcement learning (RL). However, most existing safe RL methods suffer from two limitations: (i) they assume accurate state information, and (ii) safety is generally defined over the expectation of the trajectories. It remains challenging to design optimal coordination between multi-agents while ensuring hard safety constraints under system state uncertainties (e.g., those that arise from noisy sensor measurements, communication, or state estimation methods) at every time step. We propose a safety guaranteed hierarchical coordination and control scheme called Safe-RMM to address the challenge. Specifically, the high-level coordination policy of CAVs in mixed traffic environment is trained by the Robust Multi-Agent Proximal Policy Optimization (RMAPPO) method. Though trained without uncertainty, our method leverages a worst-case Q network to ensure the model's robust performances when state uncertainties are present during testing. The low-level controller is implemented using model predictive control (MPC) with robust Control Barrier Functions (CBFs) to guarantee safety through their forward invariance property. We compare our method with baselines in different road networks in the CARLA simulator. Results show that our method provides the best evaluated safety and efficiency in challenging mixed traffic environments with uncertainties.

15:30-15:35, Paper WeDT3.4	Add to My Program
Does Bilevel Optimization Result in More Competitive Racing Behavior?

Cinar, Andrew	Vanderbilt University
Laine, Forrest	Vanderbilt University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: Two-vehicle racing is natural example of a competitive dynamic game. As with most dynamic games, there are many ways in which the underlying solution concept can be structured, resulting in different equilibrium concepts. The assumed solution concept influences the behaviors of two interacting players in racing. For example, blocking behavior emerges naturally in leader-follower play, but to achieve this in Nash play the costs would have to be chosen specifically to trigger this behavior. In this work, we develop a novel model for competitive two-player vehicle racing, represented as an equilibrium problem, complete with simplified aerodynamic drag and drafting effects, as well as position-dependent collision-avoidance responsibility. We use our model to explore how different solution concepts affect competitiveness. We develop a solution for bilevel optimization problems, enabling a large-scale empirical study comparing bilevel strategies (either as leader or follower), Nash equilibrium strategy and a single-player constant velocity baseline. We find the choice of strategies significantly affects competitive performance and safety.

15:35-15:40, Paper WeDT3.5	Add to My Program
Gate-Aware Online Planning for Two-Player Autonomous Drone Racing

Zhao, Fangguo	Zhejiang University
Mei, Jiahao	Zhejiang University of Technology
Zhou, Jin	Zhejiang University
Chen, Yuanyi	Zhejiang University
Chen, Jiming	Zhejiang University
Li, Shuo	Zhejiang University
Keywords: Motion and Path Planning, Aerial Systems: Mechanics and Control Abstract: The flying speed of autonomous quadrotors has increased significantly in the field of autonomous drone racing. However, most research primarily focuses on the aggressive flight of a single quadrotor, simplifying the racing gate traversal problem to a waypoint passing problem that neglects the orientations of the racing gates {or implicitly considers the waypoint direction during path planning}. In this paper, we propose a systematic method called Pairwise Model Predictive Control (PMPC) that can guide two quadrotors online to navigate racing gates with minimal time and without collisions. The flight task is initially simplified as a point-mass model waypoint passing problem to provide time optimal reference through an efficient two-step velocity search method. Subsequently, we utilize the spatial configuration of the racing track to compute the optimal heading at each gate, maximizing the visibility of subsequent gates for the quadrotors. To address varying gate orientations, we introduce a novel Magnetic Induction Line-based spatial curve to guide the quadrotors through racing gates of different orientations. Furthermore, we formulate a nonlinear optimization problem that uses the point-mass trajectory as initial values and references to enhance solving efficiency. The feasibility of the proposed method is validated through both simulation and real-world experiments. In real-world tests, the two quadrotors achieved a top speed of 6.1 m/s on a 7-waypoint racing track within a compact flying arena of 5 m * 4 m * 2 m.

15:40-15:45, Paper WeDT3.6	Add to My Program
TC-Driver: A Trajectory Conditioned Reinforcement Learning Approach to Zero-Shot Autonomous Racing (I)

Ghignone, Edoardo	ETH
Baumann, Nicolas	ETH
Magno, Michele	ETH Zurich
Keywords: Reinforcement Learning, Wheeled Robots, Deep Learning Methods Abstract: Autonomous racing challenges perception, planning, and control algorithms, serving as a testbed for general autonomous driving. While traditional methods like MPC can generate optimal control sequences, they are sensitive to modeling parameter accuracy. This paper introduces TC-Driver, a Reinforcement Learning (RL) approach for robust control in autonomous racing, addressing tire parameter modeling inaccuracies. TC-Driver is conditioned by a trajectory from any high-level planner, combining RL’s learning capabilities with the reliability of traditional planning. Trained under varying tire conditions, it aims to generalize across different model parameters, enhancing real-world racing performance. Experimental results show TC-Driver improves generalization robustness compared to a state-of-the-art end-to-end architecture. It achieves a 29-fold improvement in crash ratio when facing model mismatch and successfully transfers to unseen tracks with new features, while the baseline fails. In physical deployment, TC-Driver demonstrates zero-shot Sim2Real capabilities, outperforming end-to-end agents 10-fold in crash ratio while maintaining similar driving characteristics in reality as in simulation. This hybrid RL architecture leverages traditional planning methods’ reliability while exploiting RL’s ability to handle model uncertainties, offering a robust solution for autonomous racing challenges.

15:45-15:50, Paper WeDT3.7	Add to My Program
Er.autopilot 1.1: A Software Stack for Autonomous Racing on Oval and Road Course Tracks (I)

Raji, Ayoub	University of Modena and Reggio Emilia
Caporale, Danilo	Centro Di Ricerca E. Piaggio
Gatti, Francesco	Hipert Srl
Toschi, Alessandro	University of Modena and Reggio Emlilia
Musiu, Nicola	University of Modena and Reggio Emilia
Verucchi, Micaela	University of Modena and Reggio Emilia
Prignoli, Francesco	University of Modena and Reggio Emilia
Malatesta, Davide	Technology Innovation Institute -Autonomous Robotics Research Ce
Jesus, André Fialho	Technology Innovation Institute - Autonomous Robotics Research C
Finazzi, Andrea	Korea Advanced Institute of Science and Technology
Amerotti, Francesco	Università Di Pisa
Bagni, Fabio	Hipert Srl
Mascaro, Eugenio	University of Modena and Reggio Emilia
Musso, Pietro	University of Modena and Reggio Emilia
Marko, Bertogna	Unimore
Keywords: Software Architecture for Robotic and Automation, Motion and Path Planning, Sensor Fusion Abstract: In its first two seasons, the Indy Autonomous Challenge (IAC) organized a series of autonomous racing events across some of the most renowned oval racetracks, introducing various challenges including high-speed solo runs, static obstacle avoidance, and complex head-to-head passing competitions. In 2023, the challenge expanded to include a time-trial event on the iconic F1 Monza road course. This paper outlines the complete software architecture utilized by team TII Unimore Racing (formerly TII EuroRacing), er.autopilot 1.1, encompassing all modules necessary for static obstacle avoidance, active overtakes, achieving speeds over 75 m/s (270 km/h), and navigating complex road course tracks. Building on the previous version, this updated stack integrates new features such as LiDAR-based localization, lateral velocity estimation, a radar-based local controller for safe pull-overs, and refined vehicle modeling for the Model Predictive Controller. We present the overall results along with insights and lessons learned from the first two seasons, during which the team consistently achieved the podium.


WeDT4 Regular Session, 304	Add to My Program
Sensor Fusion 3

Chair: Chen, Boyuan	Duke University
Co-Chair: Huai, Zheng	University of Delaware

15:15-15:20, Paper WeDT4.1	Add to My Program
FlatFusion: Delving into Details of Sparse Transformer-Based Camera-LiDAR Fusion for Autonomous Driving

Zhu, Yutao	Shanghai Jiao Tong University
Jia, Xiaosong	University of California, Berkeley
Yang, Xinyu	Carnegie Mellon University
Yan, Junchi	Shanghai Jiao Tong University
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Object Detection, Segmentation and Categorization Abstract: The integration of data from various sensor modalities (e.g. camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for transformer-based sparse camera-LiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.

15:20-15:25, Paper WeDT4.2	Add to My Program
A2DO: Adaptive Anti-Degradation Odometry with Deep Multi-Sensor Fusion for Autonomous Navigation

Lai, Hui	Fudan University
Chen, Qi	Fudan University
Zhang, Junping	Fudan University
Pu, Jian	Fudan University
Keywords: Sensor Fusion, Localization, SLAM Abstract: Accurate localization is essential for the safe and effective navigation of autonomous vehicles, and Simultaneous Localization and Mapping (SLAM) is a cornerstone technology in this context. However, The performance of the SLAM system can deteriorate under challenging conditions such as low light, adverse weather, or obstructions due to sensor degradation. We present A2DO, a novel end-to-end multi- sensor fusion odometry system that enhances robustness in these scenarios through deep neural networks. A2DO integrates LiDAR and visual data, employing a multi-layer, multi-scale feature encoding module augmented by an attention mechanism to mitigate sensor degradation dynamically. The system is pre- trained extensively on simulated datasets covering a broad range of degradation scenarios and fine-tuned on a curated set of real-world data, ensuring robust adaptation to complex scenarios. Our experiments demonstrate that A2DO maintains superior localization accuracy and robustness across various degradation conditions, showcasing its potential for practical implementation in autonomous vehicle systems.

15:25-15:30, Paper WeDT4.3	Add to My Program
Tunable Virtual IMU Frame by Weighted Averaging of Multiple Non-Collocated IMUs

Gao, Yizhou	University of Toronto
Barfoot, Timothy	University of Toronto
Keywords: Sensor Fusion, Visual-Inertial SLAM, Localization Abstract: We present a new method to combine several rigidly connected but physically separated IMUs through a weighted average into a single virtual IMU (VIMU). This has the benefits of (i) reducing process noise through averaging, and (ii) allowing for tuning the location of the VIMU. The VIMU can be placed to be coincident with, for example, a camera frame or GNSS frame, thereby offering a quality-of-life improvement for users. Specifically, our VIMU removes the need to consider any lever-arm terms in the propagation model. We also present a quadratic programming method for selecting the weights to minimize the noise of the VIMU while still selecting the placement of its reference frame. We tested our method in simulation and validated it on a real dataset. The results show that our averaging technique works for IMUs with large separation and performance gain is observed in both the simulation and the real experiment compared to using only a single IMU.

15:30-15:35, Paper WeDT4.4	Add to My Program
WildFusion: Multimodal Implicit 3D Reconstructions in the Wild

Liu, Yanbaihui	Duke University
Chen, Boyuan	Duke University
Keywords: Sensor Fusion, Mapping, Field Robots Abstract: We propose WildFusion, a novel approach for 3D scene reconstruction in unstructured, in-the-wild environments using multimodal implicit neural representations. WildFusion integrates signals from LiDAR, RGB camera, contact microphones, tactile sensors, and IMU. This multimodal fusion generates comprehensive, continuous environmental representations, including pixel-level geometry, color, semantics, and traversability. Through real-world experiments on legged robot navigation in challenging forest environments, WildFusion demonstrates improved route selection by accurately predicting traversability. Our results highlight its potential to advance robotic navigation and 3D mapping in complex outdoor terrains.

15:35-15:40, Paper WeDT4.5	Add to My Program
Steering Prediction Via a Multi-Sensor System for Autonomous Racing

Zhou, Zhuyun	University of Burgundy (Université De Bourgogne), France
Wu, Zongwei	University of Wurzburg
Bolli, Florian	University of Zurich
Boutteau, Rémi	Université De Rouen Normandie
Yang, Fan	Univ. Bourgogne Franche-Comté
Timofte, Radu	University of Wurzburg
Ginhac, Dominique	Univ Burgundy
Delbruck, Tobi	Univ. of Zurich & ETH Zurich
Keywords: Sensor Fusion, Deep Learning for Visual Perception, Intelligent Transportation Systems Abstract: Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code, dataset, and benchmark will be released to promote future research.

15:40-15:45, Paper WeDT4.6	Add to My Program
Are Doppler Velocity Measurements Useful for Spinning Radar Odometry?

Lisus, Daniil	University of Toronto
Burnett, Keenan	University of Toronto
Yoon, David Juny	University of Toronto
Poulton, Richard	Navtech Radar
Marshall, John	Navtech Radar
Barfoot, Timothy	University of Toronto
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Range Sensing Abstract: Spinning, frequency-modulated continuous-wave (FMCW) radars with 360 degree coverage have been gaining popularity for autonomous-vehicle navigation. However, unlike 'fixed' automotive radar, commercially available spinning radar systems typically do not produce radial velocities due to the lack of repeated measurements in the same direction and the fundamental hardware setup. To make these radial velocities observable, we modified the firmware of a commercial spinning radar to use triangular frequency modulation. In this paper, we develop a novel way to use this modulation to extract radial Doppler velocity measurements from consecutive azimuths of a radar intensity scan, without any data association. We show that these noisy, error-prone measurements contain enough information to provide good ego-velocity estimates, and incorporate these estimates into different modern odometry pipelines. We extensively evaluate the pipelines on over 110 km of driving data in progressively more geometrically challenging autonomous-driving environments. We show that Doppler velocity measurements improve odometry in well-defined geometric conditions and enable it to continue functioning even in severely geometrically degenerate environments, such as long tunnels.


WeDT5 Regular Session, 305	Add to My Program
Aerial Robots: Mechanics and Control 2

Chair: Khorrami, Farshad	New York University Tandon School of Engineering
Co-Chair: Garcia de Marina, Hector	Universidad De Granada

15:15-15:20, Paper WeDT5.1	Add to My Program
Skater: A Novel Bi-Modal Bi-Copter Robot for Adaptive Locomotion in Air and Diverse Terrain

Lin, Junxiao	Zhejiang University
Zhang, Ruibin	Zhejiang University
Pan, Neng	Zhejiang University
Xu, Chao	Zhejiang University
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Motion Control Abstract: In this letter, we present a novel bi-modal bi-copter robot called Skater, which is adaptable to air and various ground surfaces. Skater consists of a bi-copter moving along its longitudinal direction with two passive wheels on both sides. Using a longitudinally arranged bi-copter as the unified actuation system for both aerial and ground modes, this robot not only keeps a concise and lightweight mechanism but also possesses exceptional terrain traversing capability and strong steering capacity. Moreover, leveraging the vectored thrust characteristic of bi-copters, the Skater can actively generate the centripetal force needed for steering, enabling it to achieve stable movement even on slippery surfaces. Furthermore, we model the comprehensive dynamics of the Skater, analyze its differential flatness, and introduce a controller using nonlinear model predictive control for trajectory tracking. The outstanding performance of the system is verified by extensive real-world experiments and benchmark comparisons.

15:20-15:25, Paper WeDT5.2	Add to My Program
Inverse Kinematics on Guiding Vector Fields for Robot Path Following

Zhou, Yu	INRIA
Bautista, Jesús	Universidad De Granada
Yao, Weijia	Hunan University
Garcia de Marina, Hector	Universidad De Granada
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Autonomous Vehicle Navigation Abstract: Inverse kinematics is a fundamental technique for motion and positioning control in robotics, typically applied to end-effectors. In this paper, we extend the concept of inverse kinematics to guiding vector fields for path following in autonomous mobile robots. The desired path is defined by its implicit equation, i.e., by a collection of points belonging to one or more zero-level sets. These level sets serve as a reference to construct an error signal that drives the guiding vector field toward the desired path, enabling the robot to converge and travel along the path by following such a vector field. We start with the formal exposition on how inverse kinematics can be applied to guiding vector fields for single-integrator robots in an m-dimensional Euclidean space. Then, we leverage inverse kinematics to ensure that the level-set error signal behaves as a linear system, facilitating control over the robot's transient motion toward the desired path and allowing for the injection of feed-forward signals to induce precise motion behavior along the path. We then propose solutions to the theoretical and practical challenges of applying this technique to unicycles with constant speeds to follow 2D paths with precise transient control. We finish by validating the predicted theoretical results through real flights with fixed-wing drones.

15:25-15:30, Paper WeDT5.3	Add to My Program
Dragonfly Drone: A Novel Tilt-Rotor Aerial Platform with Body-Morphing Capability

Hameed, Syed Waqar	NTU
Liew Jun Jie, Alex	Nanyang Technological University
Nursultan, Imanberdiyev	Agency for Science, Technology and Research (A*STAR)
Camci, Efe	Institute for Infocomm Research
Yau, Wei-Yun	I2R
Feroskhan, Mir	Nanyang Technological University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Grasping Abstract: The development of unmanned aerial vehicles (UAVs) with extended maneuverability has unlocked new applications such as complex inspection tasks at height. In this work, we introduce the Dragonfly drone, a novel tilt-rotor body-morphing UAV, capable of altering its shape and orientation without compromising its position tracking. Unlike most existing UAV designs that only target at decoupling position and orientation control, Dragonfly can also perform unique body-morphing in flight, featuring all six degrees of freedom in every morphology. This enables navigation into tight gaps with irregular shapes, conforming to obstacles of varying geometries, and maintaining physical contact with uneven surfaces. Such capabilities make our design particularly effective for complex inspection tasks at height, such as pipe or bridge inspection. Our contributions include the mechanical design of the system, the modeling and control strategies employed, and the real-robot experiments with a prototype platform. See Dragonfly drone in action: https://youtu.be/YxoV_Qt_5XE.

15:30-15:35, Paper WeDT5.4	Add to My Program
An Omnidirectional Non-Tethered Aerial Prototype with Fixed Uni-Directional Thrusters

Hamandi, Mahmoud	New York University Abu Dhabi
Ali, Abdullah Mohamed	New York University Abu Dhabi
Kyriakopoulos, Kostas	New York University - Abu Dhabi
Tzes, Anthony	New York University Abu Dhabi
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Aerial Systems: Mechanics and Control, Product Design, Development and Prototyping, Aerial Systems: Applications Abstract: This paper presents the first worldwide functional prototype omnidirectional multi-rotor aerial vehicle with fixed uni-directional thrusters, with an on-board power source. An optimization algorithm computes the positions and orientations of the propellers in the body frame of the prototype to achieve the omnidirectional capability, while minimizing the platform's weight and the required thrust to hover at any orientation, in addition to other construction requirements. The effect of the aerodynamic interaction between the different propellers is identified experimentally, and the ensuing results are included in the optimization algorithm to avoid such interactions during flight. The prototype's performance is assessed in real experiments demonstrating the decoupling between the forces and moments of the drone, its ability to track concurrently independent positions and orientations, and its ability to hover at a fixed position while rotating.

15:35-15:40, Paper WeDT5.5	Add to My Program
Dense Fixed-Wing Swarming Using Receding-Horizon NMPC

Madabushi, Varun	Georgia Institute of Technology
Kopel, Yocheved	The Johns Hopkins University Applied Physics Laboratory
Polevoy, Adam	Johns Hopkins University Applied Physics Lab
Moore, Joseph	Johns Hopkins University
Keywords: Aerial Systems: Mechanics and Control, Distributed Robot Systems, Swarm Robotics Abstract: In this paper, we present an approach for controlling a team of agile fixed-wing aerial vehicles in close proximity to one another. Our approach relies on receding-horizon nonlinear model predictive control (NMPC) to plan maneuvers across an expanded flight envelope to enable inter-agent collision avoidance. To facilitate robust collision avoidance and characterize the likelihood of inter-agent collisions, we compute a statistical bound on the probability of the system leaving a tube around the planned nominal trajectory. Finally, we propose a metric for evaluating highly dynamic swarms and use this metric to evaluate our approach. We successfully demonstrated our approach through both simulation and hardware experiments, and to our knowledge, this the first time close-quarters swarming has been achieved with physical aerobatic fixed-wing vehicles.

15:40-15:45, Paper WeDT5.6	Add to My Program
HPA-MPC: Hybrid Perception-Aware Nonlinear Model Predictive Control for Quadrotors with Suspended Loads

Sarvaiya, Mrunal	Agile Robotics and Perception Lab, NYU
Li, Guanrui	New York University
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy Abstract: Quadrotors equipped with cable-suspended loads represent a versatile, low-cost, and energy efficient solution for aerial transportation, construction, and manipulation tasks. However, their real-world deployment is hindered by several challenges. The system is difficult to control because it is nonlinear, underactuated, involves hybrid dynamics due to slack-taut cable modes, and evolves on complex configuration spaces. Additionally, it is crucial to estimate the full state and the cable’s mode transitions in real-time using on-board sensors and computation. To address these challenges, we present a novel Hybrid Perception-Aware Nonlinear Model Predictive Control (HPA-MPC) control approach for quadrotors with suspended loads. Our method considers the complete hybrid system dynamics and includes a perception-aware cost to ensure the payload remains visible in the robot’s camera during navigation. Furthermore, the full state and hybrid dynamics’ transitions are estimated using onboard sensors. Experimental results demonstrate that our approach enables stable load tracking control, even during slack-taut transitions, and operates entirely onboard. The experiments also show that the perception-aware term effectively keeps the payload in the robot’s camera field of view when a human operator interacts with the load.


WeDT6 Regular Session, 307	Add to My Program
Perception for Grasping and Manipulation

Chair: Dudek, Gregory	McGill University
Co-Chair: Zhi, Weiming	Carnegie Mellon University

15:15-15:20, Paper WeDT6.1	Add to My Program
Unifying Representation and Calibration with 3D Foundation Models

Zhi, Weiming	Carnegie Mellon University
Tang, Haozhan	Carnegie Mellon University
Zhang, Tianyi	Carnegie Mellon University
Johnson-Roberson, Matthew	Carnegie Mellon University
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception Abstract: Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of 3D foundation models. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.

15:20-15:25, Paper WeDT6.2	Add to My Program
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Dey, Sombit	InSAIT, Sofia University
Zaech, Jan-Nico	Sofia University
Nikolov, Nikolay	Imperial College London
Van Gool, Luc	ETH Zurich
Paudel, Danda Pani	ETH Zurich
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Deep Learning Methods Abstract: Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is poten- tially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA – which requires the adaptation of the visual backbones during initial training – to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks. We will make our source code and OOD evaluation framework publicly available

15:25-15:30, Paper WeDT6.3	Add to My Program
MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation Via Diffusion Model

Liu, Jian	National Engineering Research Center of Robot Vision Perception
Sun, Wei	Hunan University
Yang, Hui	Hunan University
Zheng, Jin	Central South University
Geng, Zichen	The University of Western Australia
Rahmani, Hossein	Lancaster University
Mian, Ajmal	University of Western Australia
Keywords: Perception for Grasping and Manipulation, Semantic Scene Understanding, Computer Vision for Automation Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.

15:30-15:35, Paper WeDT6.4	Add to My Program
A Full-Optical Pre-Touch Dual-Modal and Dual-Mechanism (PDM²) Sensor for Robotic Grasping

Fang, Cheng	Texas A&M University
Yan, Zhiyu	Texas A&M University
Guo, Fengzhi	Texas A&M University
Li, Shuangliang	Texas A&M University
Song, Dezhen	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Zou, Jun	Texas A&M University
Keywords: Perception for Grasping and Manipulation, Range Sensing, Grasping Abstract: We report a new full-optical pre-touch dual-modal and dual-mechanism (PDM²) sensor based on an air-coupled fiber-tip surface micromachined optical ultrasound transducer (SMOUT). Compared with the ring-shaped piezoelectric acoustic receivers in previous PDM² sensors, the acoustic signal received by the new fiber-tip SMOUT is readout optically, which is naturally resistant to surrounding electromagnetic interference (EMI) and making the complex grounding and shielding unnecessary. In addition, the new fiber-tip SMOUT receiver has a much smaller size, which makes it possible to further miniaturize the sensor package into a more compact structure. For verification, a prototype of the full-optical PDM² sensor has been designed, fabricated, and characterized. The experimental results show that even with the much smaller acoustic receiver, the new sensor can still achieve comparable ranging and material/structure sensing performances with the previous ones. Therefore, the new full-optical PDM² sensor design is promising to provide a practical and miniaturized solution for ranging and material/structure sensing to assist robotic grasping of unknown objects.

15:35-15:40, Paper WeDT6.5	Add to My Program
Learning Active Tactile Perception through Belief-Space Control

Tremblay, Jean-François	McGill University
Meger, David Paul	McGill University
Hogan, Francois	Massachusetts Institute of Technology
Dudek, Gregory	McGill University
Keywords: Perception for Grasping and Manipulation, Model Learning for Control, Planning under Uncertainty Abstract: Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.

15:40-15:45, Paper WeDT6.6	Add to My Program
Detection of Fast-Moving Objects with Neuromorphic Hardware

Ziegler, Andreas	University of Tübingen
Vetter, Karl	University of Tübingen
Gossard, Thomas	University of Tübingen
Tebbe, Jonas	University of Tübingen
Otte, Sebastian	University of Lübeck
Zell, Andreas	University of Tübingen
Keywords: Neurorobotics, Object Detection, Segmentation and Categorization, Machine Learning for Robot Control Abstract: Neuromorphic Computing (NC) and Spiking Neural Networks (SNNs) in particular are often viewed as the next generation of neural networks. NC is a novel bio-inspired paradigm for energy efficient neural computation, often relying on SNNs in which neurons communicate via spikes in a sparse, event-based manner. This communication via spikes can be exploited by neuromorphic hardware implementations very effectively and results in drastic reductions of energy consumption and latency in contrast to regular GPU-based neural networks. In recent years, neuromorphic hardware has become more accessible and the support of learning frameworks has improved. However, available hardware is partially still experimental, and it is not transparent what these solutions are effectively capable of, how they integrate into real world robotics applications, and how they realistically benefit energy efficiency and latency. In this work, we provide the robotics research community with an overview of what is possible with SNNs on neuromorphic hardware focusing on real-time processing. We introduce a benchmark of three popular neuromorphic hardware devices for the task of event based object detection. Moreover, we show that an SNN on a neuromorphic hardware is able to run in real-time in a closed loop robotic system embedded within a challenging table tennis robot scenario.

15:45-15:50, Paper WeDT6.7	Add to My Program
Grasp, See, and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior

Xu, Kechun	Zhejiang University
Zhou, Zhongxiang	Zhejiang University
Wu, Jun	Zhejiang University
Lu, Haojian	Zhejiang University
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Manipulation Planning, Deep Learning in Robotics and Automation, Grasping, Intelligent and Flexible Manufacturing Abstract: We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is valuable to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn a see policy for self-confident in-hand object matching. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rates and fewer steps.


WeDT7 Regular Session, 309	Add to My Program
Perception 2

Chair: Li, Xiaopeng	University of Wisconsin-Madison
Co-Chair: Yel, Esen	Rensselaer Polytechnic Institute

15:15-15:20, Paper WeDT7.1	Add to My Program
LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation

Guo, Xianda	School of Computer Science, Wuhan University
Zhang, Chenming	Waytous
Zhang, Youmin	University of Bologna
Zheng, Wenzhao	Tsinghua University
Nie, Dujun	Huazhong University of Science and Technology
Poggi, Matteo	University of Bologna
Chen, Long	Chinese Academy of Sciences
Keywords: Computer Vision for Transportation Abstract: We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs and 17 ms of runtime, and ranks 1st on KITTI 2015 among real-time models. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code is available at https://github.com/XiandaGuo/OpenStereo.

15:20-15:25, Paper WeDT7.2	Add to My Program
SurfaceAug: Toward Versatile, Multimodally Consistent Ground Truth Sampling

Rubel, Ryan	University of Southern California
Clark, Nathan	Noblis, Inc
Dudash, Andrew	Noblis
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, AI-Enabled Robotics Abstract: Despite recent advances in both model architectures and data augmentation, multimodal object detectors still barely outperform their LiDAR-only counterparts. This shortcoming has been attributed to a lack of sufficiently powerful multimodal data augmentation. To address this, we present SurfaceAug, a novel ground truth sampling algorithm. SurfaceAug pastes objects by resampling both images and point clouds, enabling object-level transformations in both modalities. We evaluate our algorithm by training a multimodal detector on KITTI and compare its performance to previous works. We show experimentally that SurfaceAug demonstrates promising improvements on car detection tasks.

15:25-15:30, Paper WeDT7.3	Add to My Program
Uncertainty-Guided Enhancement on Driving Perception System Via Foundation Models

Yang, Yunhao	University of Texas at Austin
Hu, Yuxin	Cruise
Ye, Mao	The University of Texas, Austin
Zhang, Zaiwei	Cruise
Lu, Zhichao	Cruise LLC
Xu, Yi	Northeastern University
Topcu, Ufuk	The University of Texas at Austin
Snyder, Ben	Cruise
Keywords: Computer Vision for Transportation, Calibration and Identification, Probability and Statistical Methods Abstract: Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models---such as enhancing object classification accuracy---while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model’s confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.

15:30-15:35, Paper WeDT7.4	Add to My Program
Complementary Information Guided Occupancy Prediction Via Multi-Level Representation Fusion

Xu, Rongtao	Institute of Automation, Chinese Academy of Sciences, Beijing, C
Lin, Jinzhou	Beijing University of Posts and Communications
Zhou, Jialei	Tongji University
Dong, Jiahua	Shenyang Institute of Automation Chinese Academy of Sciences
Wang, Changwei	Casia
Wang, Ruisheng	University of Calgary
Guo, Li	BUPT
Xu, Shibiao	Beijing University of Posts and Telecommunications
Liang, Xiaodan	Sun Yat-Sen University
Keywords: Computer Vision for Transportation, Computer Vision for Automation Abstract: Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released at https://github.com/VitaLemonTea1/CIGOcc.

15:35-15:40, Paper WeDT7.5	Add to My Program
Domain Adaptation-Based Crossmodal Knowledge Distillation for 3D Semantic Segmentation

Kang, Jialiang	Peking University
Wang, Jiawen	Peking University
Luo, Dingsheng	Peking University
Keywords: Computer Vision for Transportation, Sensor Fusion, Visual Learning Abstract: Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, real-world image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic-based Knowledge Distillation (FSKD). Leveraging readily available spatio-temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D-3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality-general information while filtering out modality-specific details during crossmodal distillation. To achieve this, we deploy self-calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state-of-the-art approaches in the field. Code will be released upon publication.

15:40-15:45, Paper WeDT7.6	Add to My Program
Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow

Liu, Zuntao	Northeastern University of China
Zhuang, Hao	Northeastern University
Jiang, Junjie	Northeastern University
Song, Yuhang	Northeastern University - China
Fang, Zheng	Northeastern University
Keywords: Computer Vision for Transportation, Visual Learning, Deep Learning for Visual Perception Abstract: Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets.

15:45-15:50, Paper WeDT7.7	Add to My Program
V2X-DG: Domain Generalization for Vehicle-To-Everything Cooperative Perception

Li, Baolu	Cleveland State University
Xu, Zongzhe	Carnegie Mellon University
Li, Jinlong	Cleveland State University
Liu, Xinyu	Cleveland State University
Fang, Jianwu	Xian Jiaotong University
Li, Xiaopeng	University of Wisconsin-Madison
Yu, Hongkai	Cleveland State University
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Deep Learning for Visual Perception Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem for LiDAR-based V2X cooperative perception (V2X-DG) based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.


WeDT8 Regular Session, 311	Add to My Program
Representation Learning 3

Chair: Zheng, Sifa	Tsinghua University
Co-Chair: Sun, Lingfeng	University of California, Berkeley

15:15-15:20, Paper WeDT8.1	Add to My Program
IMOST: Incremental Memory Mechanism with Online Self-Supervision for Continual Traversability Learning

Ma, Kehui	Shanghai Jiao Tong University
Sun, Zhen	Shanghai Jiao Tong University
Xiong, Chaoran	Shanghai Jiao Tong University
Zhu, Qiumin	Shanghai Jiao Tong University
Wang, Kewei	Shanghai Jiao Tong University
Pei, Ling	Shanghai Jiao Tong University
Keywords: Visual Learning, Incremental Learning, Learning from Demonstration Abstract: Traversability estimation is the foundation of path planning for a general navigation system. However, complex and dynamic environments pose challenges for the latest methods using self-supervised learning (SSL) technique. Firstly, existing SSL-based methods generate sparse annotations lacking detailed boundary information. Secondly, their strategies focus on hard samples for rapid adaptation, leading to forgetting and biased predictions. In this work, we propose IMOST, a continual traversability learning framework composed of two key modules: incremental dynamic memory (IDM) and self-supervised annotation (SSA). By mimicking human memory mechanisms, IDM allocates novel data samples to new clusters according to information expansion criterion. It also updates clusters based on diversity rule, ensuring a representative characterization of new scene. This mechanism enhances scene-aware knowledge diversity while maintaining a compact memory capacity. The SSA module, integrating FastSAM, utilizes point prompts to generate complete annotations in real time which reduces training complexity. Furthermore, IMOST has been successfully deployed on the quadruped robot, with performance evaluated during the online learning process. Experimental results on both public and self-collected datasets demonstrate that our IMOST outperforms current state-of-the-art method, maintains robust recognition capabilities and adaptability across various scenarios. The code is available at https://github.com/SJTU-MKH/OCLTrav.

15:20-15:25, Paper WeDT8.2	Add to My Program
SparseDrive: End-To-End Autonomous Driving Via Sparse Scene Representation

Sun, Wenchao	Tsinghua University
Lin, Xuewu	Horizon
Shi, Yining	Tsinghua University
Zhang, Chuang	Tsinghua University
Wu, Haoran	Tsinghua University
Zheng, Sifa	Tsinghua University
Keywords: Imitation Learning, Computer Vision for Transportation Abstract: The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy, which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency.

15:25-15:30, Paper WeDT8.3	Add to My Program
MOTION TRACKS: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning

Ren, Juntao	Cornell University
Sundaresan, Priya	Stanford University
Sadigh, Dorsa	Stanford University
Choudhury, Sanjiban	Cornell University
Bohg, Jeannette	Stanford University
Keywords: Imitation Learning, Learning from Demonstration, Transfer Learning Abstract: Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for both human hands and robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-π) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-π completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-π achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website (https://portal-cornell.github.io/motion_track_policy/).

15:30-15:35, Paper WeDT8.4	Add to My Program
Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Wu, Kun	Syracuse University
Zhu, Yichen	Midea Group
Li, Jinming	Shanghai University
Wen, Junjie	East China Normal University
Liu, Ning	Beijing Innovation Center of Humanoid Robotics
Xu, Zhiyuan	Midea Group
Tang, Jian	Midea Group (Shanghai) Co., Ltd
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose Discrete Policy, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

15:35-15:40, Paper WeDT8.5	Add to My Program
AnyCar to Anywhere: Learning Universal Dynamics Model for Agile and Adaptive Mobility

Xiao, Wenli	Carnegie Mellon University
Xue, Haoru	University of California Berkeley
Tao, Tony	Carnegie Mellon University
Kalaria, Dvij	Carnegie Mellon University
Dolan, John M.	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Representation Learning, Machine Learning for Robot Control, Data Sets for Robot Learning Abstract: Recent works in the robot learning community have successfully introduced generalist models capable of controlling various robot embodiments across a wide range of tasks, such as navigation and locomotion. However, achieving agile control, which pushes the limits of robotic performance, still relies on specialist models that require extensive parameter tuning. To leverage generalist-model adaptability and flexibility while achieving specialist-level agility, we propose AnyCar, a transformer-based generalist dynamics model designed for agile control of various wheeled robots. To collect training data, we unify multiple simulators and leverage different physics backends to simulate vehicles with diverse sizes, scales, and physical properties across various terrains. With robust training and real-world fine-tuning, our model enables precise adaptation to different vehicles, even in the wild and under large state estimation errors. In real-world experiments, AnyCar shows both few-shot and zero-shot generalization across a wide range of vehicles and environments, where our model, combined with a sampling-based MPC, outperforms specialist models by up to 54%. These results represent a key step toward building a foundation model for agile wheeled robot control. AnyCar is fully open-source to support further research.

15:40-15:45, Paper WeDT8.6	Add to My Program
Learning Dynamics of a Ball with Differentiable Factor Graph and Roto-Translational Invariant Representations

Xiao, Qingyu	Georgia Institute of Technology
Wu, Zixuan	Georgia Institute of Technology
Gombolay, Matthew	Georgia Institute of Technology
Keywords: Representation Learning, Probabilistic Inference, SLAM Abstract: Robots in dynamic environments need fast, accurate models of how objects move in their environments to support agile planning. In sports such as ping pong, analytical models often struggle to accurately predict ball trajectories with spins due to complex aerodynamics, elastic behaviors, and the challenges of modeling sliding and rolling friction. On the other hand, despite the promise of data-driven methods, machine learning struggles to make accurate, consistent predictions without precise input. In this paper, we propose an end-to-end learning framework that can jointly train a dynamics model and a factor graph estimator. Our approach leverages a Gram-Schmidt (GS) process to extract roto-translational invariant representations to improve the model performance, which can further reduce the validation error compared to data augmentation method. Additionally, we propose a network architecture that enhances nonlinearity by using self-multiplicative bypasses in the layer connections. By leveraging these novel methods, our proposed approach predicts the ball's position with an RMSE of 37.2 mm of the paddle radius at the apex after the first bounce, and 71.5 mm after the second bounce.


WeDT9 Regular Session, 312	Add to My Program
Motion Planning 5

Chair: Xiao, Xuesu	George Mason University
Co-Chair: Lee, Ki Myung Brian	University of California San Diego

15:15-15:20, Paper WeDT9.1	Add to My Program
Semi-Supervised Active Learning for Semantic Segmentation in Unknown Environments Using Informative Path Planning

Rückin, Julius	University of Bonn
Magistri, Federico	University of Bonn
Stachniss, Cyrill	University of Bonn
Popovic, Marija	TU Delft
Keywords: Motion and Path Planning, Deep Learning for Visual Perception, Semantic Scene Understanding Abstract: Semantic segmentation enables robots to perceive and reason about their environments beyond geometry. Most of such systems build upon deep learning approaches. As autonomous robots are commonly deployed in initially unknown environments, pre-training on static datasets cannot always capture the variety of domains and limits the robot’s perception performance during missions. Recently, self-supervised and fully supervised active learning methods emerged to improve robotic vision. These approaches rely on large in-domain pre-training datasets or require substantial human labelling effort. We propose a planning method for semi-supervised active learning of semantic segmentation that substantially reduces human labelling requirements compared to fully supervised approaches. We leverage an adaptive map-based planner guided towards the frontiers of unexplored space with high model uncertainty, collecting training data for human labelling. A key aspect of our approach is to combine the sparse high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas. Experimental results show that our method reaches segmentation performance close to fully supervised approaches with drastically reduced human labelling effort while outperforming self-supervised approaches.

15:20-15:25, Paper WeDT9.2	Add to My Program
FutureNet-LOF: Joint Trajectory Prediction and Lane Occupancy Field Prediction with Future Context Encoding

Wang, Mingkun	Peking University
Ren, Xiaoguang	Academy of Military Sciences
Jin, Ruochun	National University of Defense Technology
Li, Minglong	National University of Defense Technology
Zhang, Xiaochuan	Academy of Military Science
Yu, Changqian	Meituan
Wang, Mingxu	Fudan University
Yang, Wenjing	State Key Laboratory of High Performance Computing (HPCL), Schoo
Keywords: Motion and Path Planning, Computer Vision for Transportation, Computer Vision for Automation Abstract: Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network for joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2, while it is also the champion method of the CVPR 2024 Argoverse 2 motion forecasting challenge.

15:25-15:30, Paper WeDT9.3	Add to My Program
Hierarchical Reinforcement Learning for Safe Mapless Navigation with Congestion Estimation

Gao, Jianqi	Harbin Institute of Technology (Shenzhen)
Pang, Xizheng	Harbin Institute of Technology, Shenzhen
Liu, Qi	Northeastern University
Li, Yanjie	Harbin Institute of Technology (Shenzhen)
Keywords: Motion and Path Planning, Collision Avoidance, Reinforcement Learning Abstract: Reinforcement learning-based mapless navigation holds significant potential. However, it faces challenges in indoor environments with local minima area. This paper introduces a safe mapless navigation framework utilizing hierarchical reinforcement learning (HRL) to enhance navigation through such areas. The high-level policy creates a sub-goal to direct the navigation process. Notably, we have developed a sub-goal update mechanism that considers environment congestion, efficiently avoiding the entrapment of the robot in local minimum areas. The low-level motion planning policy, trained through safe reinforcement learning, outputs real-time control instructions based on acquired sub-goal. Specifically, to enhance the robot's environmental perception, we introduce a new obstacle encoding method that evaluates the impact of obstacles on the robot's motion planning. To validate the performance of our HRL-based navigation framework, we conduct simulations in office, home, and restaurant environments. The findings demonstrate that our HRL-based navigation framework excels in both static and dynamic scenarios. Finally, we implement the HRL-based navigation framework on a TurtleBot3 robot for physical validation experiments, which exhibits its strong generalization capabilities.

15:30-15:35, Paper WeDT9.4	Add to My Program
Hierarchical End-To-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning

Lu, Siyi	Central South University
He, Lei	Tsinghua University
Li, Shengbo Eben	Tsinghua University
Luo, Yugong	Tsinghua University
Wang, Jianqiang	Tsinghua University
Li, Keqiang	Tsinghua University
Keywords: Integrated Planning and Control, Reinforcement Learning, Vision-Based Navigation Abstract: End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird’s-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%. The code of our approach is publicly available at https://github.com/CBDES-e2e/PEDe2e-driving

15:35-15:40, Paper WeDT9.5	Add to My Program
Multi-Goal Motion Memory

Lu, Yuanjie	George Mason University
Das, Dibyendu	George Mason University
Plaku, Erion	U.S. National Science Foundation
Xiao, Xuesu	George Mason University
Keywords: Integrated Planning and Learning, Motion and Path Planning, Deep Learning Methods Abstract: Autonomous mobile robots (e.g., warehouse logistics robots) often need to traverse complex, obstacle-rich, and changing environments to reach multiple fixed goals (e.g., warehouse shelves). Traditional motion planners need to calculate the entire multi-goal path from scratch in response to changes in the environment, which results in a large consumption of computing resources. This process is not only time-consuming but also may not meet real-time requirements in application scenarios that require rapid response to environmental changes. In this paper, we provide a novel Multi-Goal Motion Memory technique that allows robots to use previous planning experiences to accelerate future multi-goal planning in changing environments. Specifically, our technique predicts collision-free and dynamically-feasible trajectories and distances between goal pairs to guide the sampling process to build a roadmap, to inform a Traveling Salesman Problem (TSP) solver to compute a tour, and to efficiently produce motion plans. Experiments conducted with a vehicle and a snake-like robot in obstacle-rich environments show that the proposed Motion Memory technique can substantially accelerate planning speed by up to 90%. Furthermore, the solution quality is comparable to state-of-the-art algorithms and even better in some environments.

15:40-15:45, Paper WeDT9.6	Add to My Program
Dual-BEV Nav: Dual-Layer BEV-Based Heuristic Path Planning for Robotic Navigation in Unstructured Outdoor Environments

Zhang, Jianfeng	East China Normal University
Dong, Hanlin	East China Normal University
Yang, Jian	Information Engineering University
Liu, Jiahui	Fujian Normal University
Huang, Shibo	East China Normal University
Li, Ke	Information Engineering University
Tang, Xuan	East China Normal University
Wei, Xian	East China Normal University
You, Xiong	Information Engineering University
Keywords: Integrated Planning and Learning, Imitation Learning, Vision-Based Navigation Abstract: Path planning with strong environmental adaptability plays a crucial role in robotic navigation in unstructured outdoor environments, especially in the case of low-quality location and map information. The path planning ability of a robot depends on the identification of the traversability of global and local ground areas. In real-world scenarios, the complexity of outdoor open environments makes it difficult for robots to identify the traversability of ground areas that lack a clearly defined structure. Moreover, most existing methods have rarely analyzed the integration of local and global traversability identifications in unstructured outdoor scenarios. To address this problem, we propose a novel method, Dual-BEV Nav, first introducing Bird’s Eye View (BEV) representations into local planning to generate high-quality traversable paths. Then, these paths are projected into the global traversability probability map generated by the global BEV planning model to obtain the optimal path. By integrating the traversability from both local and global BEV, we establish a dual-layer BEV heuristic planning paradigm, enabling long-distance navigation in unstructured outdoor environments. We test our approach through both public dataset evaluations and real-world robot deployments, yielding promising results. Compared to baselines, the Dual-BEV Nav improved temporal distance prediction accuracy by up to 18.26%. In the real-world deployment, under conditions significantly different from the training set and with notable occlusions in the global BEV, the Dual-BEV Nav successfully achieved a 65-meter-long outdoor navigation. Further analysis demonstrates that the local BEV representation significantly enhances the rationality of the planning, while the global BEV probability map ensures the robustness of the overall plan

15:45-15:50, Paper WeDT9.7	Add to My Program
Risk-Aware Integrated Task and Motion Planning for Versatile Snake Robots under Localization Failures

M. Jasour, Ashkan	MIT
Daddi, Guglielmo	Politecnico Di Torino
Endo, Masafumi	Keio University
Vaquero, Tiago	JPL, Caltech
Paton, Michael	Jet Propulsion Laboratory
Strub, Marlin Polo	NASA Jet Propulsion Laboratory
Corpino, Sabrina	Politecnico Di Torino
Ingham, Michel	NASA-JPL
Ono, Masahiro	California Institute of Technology
Thakker, Rohan	Nasa's Jet Propulsion Laboratory, Caltech
Keywords: Planning under Uncertainty, Task and Motion Planning, Biologically-Inspired Robots Abstract: Snake robots enable mobility through extreme terrains and confined environments in terrestrial and space applications. However, robust perception and localization for snake robots remain an open challenge due to the proximity of the sensor payload to the ground coupled with a limited field of view. To address this issue, we propose Blind-motion with Intermittently Scheduled Scans (BLISS) which combines proprioception-only mobility with intermittent scans to be resilient against both localization failures and collision risks. BLISS is formulated as an integrated task and motion planning (TAMP) problem that leads to a chance-constrained hybrid partially observable Markov decision process (CC-HPOMDP), known to be computationally intractable due to the curse of history. Our novelty lies in reformulating CC-HPOMDP as a tractable, convex mixed integer linear program. This allows us to solve BLISS-TAMP significantly faster and jointly derive optimal task-motion plans. Simulations and hardware experiments on the EELS snake robot show our method achieves over an order of magnitude computational improvement compared to state-of-the-art POMDP planners and > 50% better navigation time optimality versus classical two-stage planners.


WeDT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 2

Chair: Au, Tsz-Chiu	Texas State University
Co-Chair: Bhattacharya, Sourabh	Iowa State University

15:15-15:20, Paper WeDT10.1	Add to My Program
Formation Rotation and Assignment: Avoiding Obstacles in Multi-Robot Scenarios

Zhang, Zhan	Northwestern Polytechnical University
Li, Yan	Northwestern Polytechnical University
Gu, Zhiyang	School of Automation, Northwestern Polytechnical University
Wang, Zhong	Northwestern Polytechnical University
Keywords: Multi-Robot Systems, Cooperating Robots, Collision Avoidance Abstract: Current formation assignment and optimization methods frequently overlook the influence of rotational dynamics, limiting their operational flexibility. Additionally, these methods typically neglect the impact of obstacles, which may also hinder their effectiveness in obstacle-rich environments. To address these limitations, this paper proposes a novel approach that incorporates both rotation and assignment into the formation optimization of multi-robot systems. This approach allows for dynamic adjustment of the formation orientation and introduces a collaborative obstacle avoidance strategy. This strategy is specifically designed to assess and integrate the influence of obstacles into the optimization process, thereby enhancing the ability to maneuver around obstacles. Simulation experiments, including scenarios involving the encirclement of stationary and moving targets, validate the effectiveness of the proposed algorithm. The proposed algorithm outperforms non-rotational methods in maintaining formations under the influence of various types of obstacles while encircling targets. Furthermore, real-world flight experiments demonstrate the robustness and feasibility of the algorithm.

15:20-15:25, Paper WeDT10.2	Add to My Program
A Streamlined Heuristic for the Problem of Min-Time Coverage in Constricted Environments (I)

Kim, Young-In	ISyE, Georgia Tech
Reveliotis, Spiridon	Georgia Institute of Technology
Keywords: Robotics in Hazardous Fields, Planning, Scheduling and Coordination, Optimization and Optimal Control Abstract: The problem of min-time coverage in constricted environments concerns the employment of robotic fleets to support routine inspection and service operations within well-structured but constricted environments. In our previous work we have provided a detailed definition of this problem, specifying the objectives and the constraints involved, a Mixed Integer Programming (MIP) formulation for it, a formal analysis of its worst-case computational complexity, and additional structural properties of the optimal solutions that enable a partial relaxation of the original MIP formulation which preserves optimal performance. We have further employed these structural results towards the development of a construction heuristic for this problem. But while the worst-case computational complexity of the construction heuristic is polynomial with respect to the size of the problem-defining elements, its practical scalability has been limited by the requirement to formulate and solve a large number of linear programming formulations. In order to address this issue, this work presents a modified version of the heuristic that significantly reduces the computational times involved. Furthermore, we develop a local search method that further improves the solution obtained from the modified heuristic.

15:25-15:30, Paper WeDT10.3	Add to My Program
Scalable Multi-Agent Surveillance: A Kernel-Based Approach

Mandal, Shashwata	Iowa State University
Bhattacharya, Sourabh	Iowa State University
Keywords: Motion and Path Planning, Computational Geometry, Multi-Robot Systems Abstract: In this work, we address the deployment problem for a team of mobile guards that tries to maintain a line-of-sight with an unpredictable mobile intruder. First, we present a computationally efficient strategy for generating a set of points, called `kernel points`, that covers the entire polygon. We then introduce a polygon partitioning technique based on the location of the kernel points. Next, we propose control laws for a free guard to track an intruder in general polygonal environments based on the analysis of a pursuit-evasion game around a single corner basepaper. Finally, we present several variations of the proposed control laws that include capture and search, and illustrate the improvement in the overall visual footprint of the team of mobile guards based on extensive simulations.

15:30-15:35, Paper WeDT10.4	Add to My Program
Contingency Formation Planning for Interactive Drone Light Shows

Au, Tsz-Chiu	Texas State University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Swarm Robotics Abstract: One of the most appealing applications of drone swarms is drone light shows, in which a group of drones displays an animation by showing a sequence of light patterns in the sky. In this paper, we consider using drone swarms as video game platforms and utilize planning techniques to display pixels in animations correctly while providing a fast response to user inputs. We devise a new sampling algorithm to solve a contingency formation planning problem, which aims to find a contingency formation plan such that drones can always move to the correct positions to display every possible future frame regardless of the user inputs in the future. The algorithm provides interactivity by preemptively relocating hidden drones, which move in stealth mode to the locations of all possible future frames. Our experiments show that the size of the frame buffer and the ratio between the number of drones and the number of pixels can greatly affect the performance of our system.

15:35-15:40, Paper WeDT10.5	Add to My Program
Design of a Formation Control System to Assist Human Operators in Flying a Swarm of Robotic Blimps

Wu, Tianfu	Hong Kong University of Science and Technology
Fu, Jiaqi	Beijing Jiaotong University
Meng, Wugang	Hong Kong University of Science and Technology
Cho, Sungjin	Sunchon National University
Zhan, Huanzhe	Emory University
Zhang, Fumin	Hong Kong University of Science and Technology
Keywords: Swarm Robotics, Aerial Systems: Applications, Autonomous Vehicle Navigation Abstract: Formation control is essential for swarm robotics, enabling coordinated behavior in complex environments. In this paper, we introduce a novel formation control system for an indoor blimp swarm using a specialized leader-follower approach enhanced with a dynamic leader-switching mechanism. This strategy allows any blimp to take on the leader role, distributing maneuvering demands across the swarm and enhancing overall formation stability. Only the leader blimp is manually controlled by a human operator, while follower blimps use onboard monocular cameras and a laser altimeter for relative position and altitude estimation. A leader-switching scheme is proposed to assist the human operator to maintain stability of the swarm, especially when a sharp turn is performed. Experimental results confirm that the leader-switching mechanism effectively maintains stable formations and adapts to dynamic indoor environments while assisting human operator.

15:40-15:45, Paper WeDT10.6	Add to My Program
Multi-Agent Exploration with Similarity Score Map and Topological Memory

Lee, Eun Sun	Seoul National University
Kim, Young Min	Seoul National University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Vision-Based Navigation, Multi-Robot Systems Abstract: Multi-robot exploration can be a collaborative solution for navigating a large-scale area. However, it is not trivial to optimally assign tasks among agents because the state dynamically changes while the local observations of multiple agents concurrently update the global map. Furthermore, the individual robots may not have access to accurate relative poses of others or global layouts. We propose an efficient spatial abstraction for multi-agent exploration based on topological graph memories. Each agent creates a topological graph, a lightweight spatial representation whose nodes contain minimal image features. The information in graphs is aggregated to compare individual nodes and is used to update the similarity scores in real-time. Then, the agents effectively fulfill distributed task goals by examining the dynamic similarity scores of frontier nodes. We further exploit extracted visual features to refine the relative poses among topological graphs. Our proposed pipeline can efficiently explore large-scale areas among various scene and robot configurations without sharing precise geometric information.

15:45-15:50, Paper WeDT10.7	Add to My Program
DREAM: Decentralized Real-Time Asynchronous Probabilistic Trajectory Planning for Collision-Free Multi-Robot Navigation in Cluttered Environments

Şenbaşlar, Baskın	NVIDIA
Sukhatme, Gaurav	University of Southern California
Keywords: Collision Avoidance, Multi-Robot Systems, Motion and Path Planning, Probabilistic Trajectory Planning Abstract: Collision-free navigation in cluttered environments with static and dynamic obstacles is essential for many multi-robot tasks.Dynamic obstacles may also be interactive, i.e., their behavior varies based on the behavior of other entities.We propose a novel representation for interactive behavior of dynamic obstacles and a decentralized real-time multi-robot trajectory planning algorithm allowing inter-robot collision avoidance as well as static and dynamic obstacle avoidance.Our planner simulates the behavior of dynamic obstacles, accounting for interactivity.We account for the perception inaccuracy of static and prediction inaccuracy of dynamic obstacles.We handle asynchronous planning between teammates and message delays, drops, and re-orderings.We evaluate our algorithm in simulations using 25400 random cases and compare it against three state-of-the-art baselines using 2100 random cases.Our algorithm achieves up to 1.68x success rate using as low as 0.28x time in single-robot, and up to 2.15x success rate using as low as 0.36x time in multi-robot cases compared to the best baseline.We implement our planner on real quadrotors to show its real-world applicability.


WeDT11 Regular Session, 314	Add to My Program
Foundation Models for Manipulation

Chair: Rivera, Corban	Johns Hopkins University Applied Physics Lab
Co-Chair: Kober, Jens	TU Delft

15:15-15:20, Paper WeDT11.1	Add to My Program
Enhancing the LLM-Based Robot Manipulation through Human-Robot Collaboration

Liu, Haokun	The University of Tokyo
Zhu, Yaonan	University of Tokyo
Kato, Kenji	National Center for Geriatrics and Gerontology
Tsukahara, Atsushi	Shinshu University
Kondo Izumi, Kondo Izumi	National Center for Geriatrics and Gerontology
Aoyama, Tadayoshi	Nagoya University
Hasegawa, Yasuhisa	Nagoya University
Keywords: AI-Enabled Robotics, Human-Robot Collaboration Abstract: Large Language Models (LLMs) are gaining popularity in the field of robotics. However, LLM-based robots are limited to simple, repetitive motions due to the poor integration between language models, robots, and the environment. This paper proposes a novel approach to enhance the performance of LLM-based autonomous manipulation through Human-Robot Collaboration (HRC). The approach involves using a prompted GPT-4 language model to decompose high-level language commands into sequences of motions that can be executed by the robot. The system also employs a YOLO-based perception algorithm, providing visual cues to the LLM, which aids in planning feasible motions within the specific environment. Additionally, an HRC method is proposed by combining teleoperation and Dynamic Movement Primitives (DMP), allowing the LLM-based robot to learn from human guidance. Real-world experiments have been conducted using the Toyota Human Support Robot for manipulation tasks. The outcomes indicate that tasks requiring complex trajectory planning and reasoning over environments can be efficiently accomplished through the incorporation of human demonstrations.

15:20-15:25, Paper WeDT11.2	Add to My Program
In-Context Learning Enables Robot Action Prediction in LLMs

Yin, Yida	University of California, Berkeley
Wang, Zekai	University of California, Berkeley
Sharma, Yuvan	University of California, Berkeley
Niu, Dantong	University of California, Berkeley
Darrell, Trevor	UC Berkeley
Herzig, Roei	Tel Aviv University
Keywords: Deep Learning Methods, Learning from Demonstration Abstract: Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within off-the-shelf LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RobotPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RobotPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.

15:25-15:30, Paper WeDT11.3	Add to My Program
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Yu, Qiaojun	Shanghai Jiao Tong University
Huang, Siyuan	Shanghai Jiao Tong University
Yuan, Xibin	Shanghai Jiao Tong University
Jiang, Zhengkai	Tencent
Hao, Ce	University of California, Berkeley
Li, Xin	Shanghai Jiao Tong University
Chang, Haonan	Rutgers University
Wang, Junbo	Shanghai Jiao Tong University
Liu, Liu	Hefei University of Technology
Li, Hongsheng	Chinese University of Hong Kong
Gao, Peng	Shanghai AI Lab
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception, Deep Learning in Grasping and Manipulation Abstract: Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset and code are published on the project website at:https://sites.google.com/view/uni-aff/home.

15:30-15:35, Paper WeDT11.4	Add to My Program
ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution

Rivera, Corban	Johns Hopkins University Applied Physics Lab
Byrd, Grayson	Johns Hopkins University
Paul, William	Johns Hopkins University Applied Physics Lab
Feldman, Tyler	Johns Hopkins University Applied Physics Laboratory
Booker, Meghan	Princeton University
Holmes, Emma	Johns Hopkins University Applied Physics Lab
Handelman, David	American Android Corp
Kemp, Bethany	Johns Hopkins Applied Physics Laboratory
Badger, Andrew	JHUAPL
Schmidt, Aurora	Johns Hopkins University Applied Physic Laboratory
Jatavallabhula, Krishna Murthy	MIT
de Melo, Celso	CCDC US Army Research Laboratory
Seenivasan, Lalithkumar	Johns Hopkins University
Unberath, Mathias	Johns Hopkins University
Chellappa, Rama	Johns Hopkins University
Keywords: Agent-Based Systems, Mobile Manipulation Abstract: Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. ConceptAgent combines these planning enhancements with dynamic language aligned 3d scene graphs, and large multi-modal pretrained models to perceive, localize, and interact with its environment, enabling reliable task completion. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.

15:35-15:40, Paper WeDT11.5	Add to My Program
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-Guided 3D Policy

Garcia-Pinel, Ricardo	Inria
Chen, Shizhe	Inria
Schmid, Cordelia	Inria
Keywords: Grippers and Other End-Effectors, Software Tools for Benchmarking and Reproducibility, Deep Learning Methods Abstract: Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. Code, dataset, real robot videos and trained models are available at url{https://www.di.ens.fr/willow/research/gembench/}.

15:40-15:45, Paper WeDT11.6	Add to My Program
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

Mavrogiannis, Angelos	University of Maryland, College Park
Yuan, Dehao	University of Maryland, College Park
Aloimonos, Yiannis	University of Maryland
Keywords: AI-Based Methods, Computer Architecture for Robotic and Automation, Software, Middleware and Programming Environments Abstract: There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.

15:45-15:50, Paper WeDT11.7	Add to My Program
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models

Ma, Runyu	Tu Delft
Luijkx, Jelle Douwe	Delft University of Technology
Ajanovic, Zlatan	RWTH Aachen University
Kober, Jens	TU Delft
Keywords: AI-Based Methods, Reinforcement Learning Abstract: In robot manipulation, Reinforcement Learning (RL) often suffers from low sample efficiency and uncertain convergence, especially in large observation and action spaces. Foundation Models (FMs) offer an alternative, demonstrating promise in zero-shot and few-shot settings. However, they can be unreliable due to limited physical and spatial understanding. We introduce ExploRLLM, a method that combines the strengths of both paradigms. In our approach, FMs improve RL convergence by generating policy code and efficient representations, while a residual RL agent compensates for the FMs' limited physical understanding. We show that ExploRLLM outperforms both policies derived from FMs and RL baselines in table-top manipulation tasks. Additionally, real-world experiments show that the policies exhibit promising zero-shot sim-to-real transfer. Supplementary material is available at https://explorllm.github.io.


WeDT12 Regular Session, 315	Add to My Program
Robotics and Automation in Construction and Industry

Chair: Muratore, Luca	Istituto Italiano Di Tecnologia
Co-Chair: Werfel, Justin	Harvard University

15:15-15:20, Paper WeDT12.1	Add to My Program
Physical Simulation with Force Feedback Aids Robot Factors Design

Kaeser, Carina	Student
Melenbrink, Nathan	Harvard University
Karp, Allison	Harvard
Werfel, Justin	Harvard University
Keywords: Product Design, Development and Prototyping, Space Robotics and Automation, Haptics and Haptic Interfaces Abstract: "Robot factors" design, analogous to ergonomics for humans, seeks to create devices and equipment that can be readily operated by robots, by considering typical capabilities of current robots throughout the design process. While a number of principles and heuristics for robot factors design have been identified, the successful design of hardware operable by autonomous robots often depends in practice on the designer's intuition about robot capabilities, developed through personal experience working with robots. Here we present a tool we have developed to help evaluate a potential device design for usability by a robot, by allowing a designer to in effect teleoperate a virtual robot and attempt the operation of the device. The tool uses a 3D physics-based simulation built in Unity, and a Phantom Omni / Geomagic Touch haptic device that controls the virtual robot's end-effector and provides force feedback. Through user studies, we show that the use of this tool can significantly improve a user's estimation of the suitability of a design for robot operation, in two case studies involving replacing a unit in a modular hardware system and unzipping a canvas bag. By incorporating the use of such a tool early in the design cycle, designers can more effectively develop equipment to be used by autonomous robots without themselves needing direct robotics experience; as a result, robots will be able to take on more tasks in the nearer term with current robot technology.

15:20-15:25, Paper WeDT12.2	Add to My Program
Environmental Map Learning with Multiple-Robots

Shamshirgaran, Azin	University of California, Merced
Carpin, Stefano	University of California, Merced
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Multi-Robot Systems Abstract: This paper explores decision-making processes in robotic systems tasked with reconstructing scalar fields through sensing in uncertain environments. Each robot must handle noisy perception and operate within specific environmental and physical constraints. The complexity increases in multi-agent scenarios, where robots must not only plan their actions but also anticipate the movements and strategies of other agents. Effective coordination is crucial to prevent collisions and minimize redundant tasks. To address this challenge, we propose an online, distributed multi-robot sampling algorithm that combines Monte Carlo Tree Search (MCTS) with Gaussian regression. In this approach, each robot iteratively selects its next sampling point while exchanging limited information with other robots and predicting their future actions. Predictions about other robots future actions are computed with a MCTS that is recomputed at each iteration to incorporate all information collected up to that point. We evaluate the performance of our method across diverse environments and team sizes, comparing it to algorithmic alternatives.

15:25-15:30, Paper WeDT12.3	Add to My Program
SLABIM: A SLAM-BIM Coupled Dataset in HKUST Main Building

Huang, Haoming	The Hong Kong University of Science and Technology
Qiao, Zhijian	Hong Kong University of Science and Technology
Yu, Zehuan	Hong Kong University of Science and Technology
Liu, Chuhao	Hong Kong University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Yin, Huan	Hong Kong University of Science and Technology
Keywords: Robotics and Automation in Construction, Data Sets for SLAM, Data Sets for Robotic Vision Abstract: Existing indoor SLAM datasets primarily focus on robot sensing, often lacking building architectures. To address this gap, we design and construct the first dataset to couple the SLAM and BIM, named SLABIM. This dataset provides BIM and SLAM-oriented sensor data, both modeling a university building at HKUST. The as-designed BIM is decomposed and converted for ease of use. We employ a multi-sensor suite for multi-session data collection and mapping to obtain the as-built model. All the related data are timestamped and organized, enabling users to deploy and test effectively. Furthermore, we deploy advanced methods and report the experimental results on three tasks: registration, localization and semantic mapping, demonstrating the effectiveness and practicality of SLABIM. We make our dataset open-source at https://github.com/HKUST-Aerial-Robotics/SLABIM.

15:30-15:35, Paper WeDT12.4	Add to My Program
Unified Adaptive and Cooperative Planning Using Multi-Task Coregionalized Gaussian Processes

Booth, Lorenzo A.	University of California Merced
Carpin, Stefano	University of California, Merced
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Planning, Scheduling and Coordination Abstract: For robots tasked with surveying the temporal dynamics of a changing environment, a choice must be made to observe novel regions of the environment or to re-survey previously visited regions, which may have changed. We present a novel multi-robot informative path planner (IPP) that combines an environmental and task kernel to direct mobile robots to gather samples from regions that would result in the greatest expected improvement in map accuracy. Our planner utilizes a multi-output Gaussian process to unify priors about the spatiotemporal environment along with priors about observational correlations between sensing vehicles. Additionally, we extend our analysis into an adaptive planning scenario and examine the performance under different planning configurations. We find that planning performance is largely driven by the choice of environmental priors, and that unrepresentative priors can be improved through adaptive planning.

15:35-15:40, Paper WeDT12.5	Add to My Program
COIGAN: Controllable Object Inpainting through Generative Adversarial Network for Defect Synthesis in Data Augmentation

Biancucci, Massimiliano	Università Politecnica Delle Marche
Galdelli, Alessandro	Università Politecnica Delle Marche
Narang, Gagan	Università Politecnica Delle Marche
Pietrini, Rocco	Universià Politecnica Delle Marche
Mancini, Adriano	Università Politecnica Delle Marche
Zingaretti, Primo	Università Politecnica Delle Marche
Keywords: Robotics and Automation in Construction, AI-Enabled Robotics, Deep Learning Methods Abstract: Predictive maintenance is a key aspect for the safety of critical infrastructure such as bridges, dams, and tunnels, where a failure can lead to catastrophic outcomes in terms of human lives and costs. The surge in Artificial Intelligence-driven visual robotic inspection methods necessitates high-quality datasets containing diverse defect classes with several instances on different conditions (e.g., material, illumination). In this context, we introduce a Controllable Object Inpainting Generative Adversarial Network (COIGAN) to synthetically generate realistic images that augment defect datasets. The effectiveness of the model is quantitatively validated by a Fréchet Inception Distance, which measures the similarity between the generated and training samples. To further evaluate the impact of COIGAN-generated images, a segmentation task was conducted, utilizing key performance metrics such as segmentation accuracy, mAP, mIoU, and F1 score, demonstrating that the synthetic images integrate seamlessly and produce results comparable to real defect images. Subsequently, COIGAN generability was successfully used for the segmentation of a defect-free dataset by inpainting defects. The results showcase COIGAN's ability to learn defect patterns and apply them in new contexts, preserving the original features of the base image and allowing the creation of new datasets with a desired multi-class distribution. Specifically, in the context of predictive maintenance, COIGAN enriches datasets, enabling deep learning models to more effectively identify potential infrastructure anomalies. Project page: https://bit.ly/4bzxwqf.

15:40-15:45, Paper WeDT12.6	Add to My Program
Diffusion Based Robust LiDAR Place Recognition

Krummenacher, Benjamin	ETH Zurich
Frey, Jonas	ETH Zurich
Tuna, Turcan	ETH Zurich, Robotic Systems Lab
Vysotska, Olga	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Robotics and Automation in Construction, Localization Abstract: Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of the LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% (threshold 2m) on average while outperforming baselines at a factor of 2 in mean error.

15:45-15:50, Paper WeDT12.7	Add to My Program
Enhancing Robotic Precision in Construction: A Modular Factor Graph-Based Framework to Deflection and Backlash Compensation Using High-Accuracy Accelerometers

Kindle, Julien	ETH Zurich
Loetscher, Michael	ETH Zurich, Hilti
Alessandretti, Andrea	Hilti Group
Cadena, Cesar	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Robotics and Automation in Construction, Localization, Sensor Fusion Abstract: Accurate positioning is crucial in the construction industry, where labor shortages highlight the need for automation. Robotic systems with long kinematic chains are required to reach complex workspaces, including floors, walls, and ceilings. These requirements significantly impact positioning accuracy due to effects such as deflection and backlash in various parts along the kinematic chain. In this letter, we introduce a novel approach that integrates deflection and backlash compensation models with high-accuracy accelerometers, significantly enhancing position accuracy. Our method employs a modular framework based on a factor graph formulation to estimate the state of the kinematic chain, leveraging acceleration measurements to inform the model. Extensive testing on publicly released datasets, reflecting real-world construction disturbances, demonstrates the advantages of our approach. The proposed method reduces the 95% error threshold in the xy-plane by 50% compared to the state-of-the-art Virtual Joint Method, and by 31% when incorporating base tilt compensation.


WeDT13 Regular Session, 316	Add to My Program
Manipulation and Locomotion Using Magnetic Fields

Chair: Tanner, Herbert G.	University of Delaware
Co-Chair: Bergbreiter, Sarah	Carnegie Mellon University

15:15-15:20, Paper WeDT13.1	Add to My Program
Open-Loop Position Control of a Miniature Magnetic Robot Using Two-Dimensional Divergence Control of a Magnetic Force

Lee, Hakjoon	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Latifi Gharamaleki, Nader	DGIST
Choi, Hongsoo	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Actuation and Joint Mechanisms Abstract: Miniature magnetic robots have attracted considerable attention as promising tools in biomedical applications due to their wireless actuation and precise controllability in a minimally invasive manner. Traditionally, magnetic microrobots have been controlled by globally applied magnetic torques and forces generated by external magnetic actuation systems (MASs), which typically require closed-loop control with real-time vision tracking—a challenging requirement in in-vivo environments. To address this issue, this paper suggests a novel open-loop control scheme for magnetic robots, using two-dimensional (2D) divergence control of a magnetic force generated by stationary electromagnets. Constraint equations for the currents applied to the electromagnets were established to achieve 2D divergence control of a magnetic force. Numerical simulation and experimental validations demonstrate that this approach can generate sufficient magnetic forces that either converge at or diverge from a target point, enabling effective open-loop position control of a miniature magnetic robot. Due to the absence of vision feedback and mechanical motions of magnets, the proposed control strategy could be more clinically applicable for medical applications of magnetic robots.

15:20-15:25, Paper WeDT13.2	Add to My Program
An Equilibrium Analysis of Magnetic Quadrupole Force Field with Applications to Microrobotic Swarm Coordination

Faros, Ioannis	University of Delaware
Tanner, Herbert G.	University of Delaware
Keywords: Swarm Robotics, Planning, Scheduling and Coordination, Micro/Nano Robots Abstract: Controlled microrobots in fluidic environments hold promise for precise drug delivery and cell manipulation, opening new ways for personalized healthcare. However, coordinating magnetic microrobot swarms presents significant challenges due to the complexity of the associated actuation mechanisms. While existing methods to achieve motion differentiation in collections of microrobots rely on design variations among them, the work reported here applies to homogeneous collectives and enables them to be steered as a whole or in fragments, by means of a common externally generated force field. This paper contributes to an emerging set of methods that enable swarm control through manipulation of these force fields. This paper in particular exploits the nature of force field equilibria in a quadrupole workspace configuration as a means of steering the swarm while maintaining its cohesion. The approach also enables splitting the swarm in two subgroups in order to direct each simultaneously to a different location.

15:25-15:30, Paper WeDT13.3	Add to My Program
Ensemble Control of a 2-DOF Parallel Link Arm in a Capsule Robot Using Oscillating External Magnetic Fields

Zhao, Zihan	The University of Sheffield
Hafez, Ahmed	University of Sheffield
Miyashita, Shuhei	University of Sheffield
Keywords: Medical Robots and Systems, Mechanism Design, Micro/Nano Robots Abstract: Providing oral capsule robots with additional degrees of freedom (DOF), such as robotic arms, is crucial for enhancing their functionality within the body. However, a key challenge arises when using rotating magnetic fields to drive the motor within the robot, as the resulting torque causes the entire capsule to rotate. In this work, we propose a novel approach to actuate a 2 DOF parallel link robot arm integrated into a capsule robot, using external magnetic fields. Our method employs two identical magnetic motors we proposed in a previous study, each driven by an oscillating magnetic field, which alternates direction along a specific axis. By independently controlling the rotation of each motor through the same magnetic field, ensemble control is achieved. The symmetrically arranged motors exhibit different angular velocities, enabling dexterous movement of the robot arm. We further theoretically show that this approach significantly reduces the torque exerted on the robot compared to traditional rotating magnetic fields. Finally, we demonstrate the performance of the robot by moving its arms and the attached end-effector along a pre-defined trajectory.

15:30-15:35, Paper WeDT13.4	Add to My Program
Deep Reinforcement Learning-Based Semi-Autonomous Control for Magnetic Micro-Robot Navigation with Immersive Manipulation

Mao, Yudong	Imperial College London
Zhang, Dandan	Imperial College London
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales Abstract: Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not provide an intuitive manner for operators to manipulate micro-robots. These limitations increase the cognitive load on operators, who must interpret limited feedback and translate it into effective control actions. To address these challenges, we propose a Deep Reinforcement Learning-Based Semi-Autonomous Control (DRL-SC) framework for magnetic micro-robot navigation in a simulated microvascular system. Our framework integrates Mixed Reality (MR) to facilitate immersive manipulation of micro-robots, thereby enhancing situational awareness and control precision. Simulation and experimental results demonstrate that our approach significantly improves navigation efficiency, reduces control errors, and enhances the overall robustness of the system in simulated microvascular environments.

15:35-15:40, Paper WeDT13.5	Add to My Program
OMASTAR Optimal Magnetic Actuation System Arrangement

Palanichamy, Veerash	McMaster University
Saad, Hussein	McMaster University
Giamou, Matthew	McMaster University
Onaizah, Onaizah	McMaster University
Keywords: Micro/Nano Robots, Surgical Robotics: Steerable Catheters/Needles, Optimization and Optimal Control Abstract: Microrobots and other miniature robots are able to access millimeter-sized spaces and thus have the potential to solve many challenging problems in healthcare. However, clinical adoption of these robots is rare as these systems are often difficult to scale up. One such issue arises from the actuation systems used to remotely control magnetic microrobots, which tend to be bulky and obstruct the surgeons' workspaces. They also do not guarantee wide ranges of magnetic fields and forces in a large patient-sized workspace. In this paper, we present the design of a permanent magnet-based actuation system that fits within a 40 cm cube of space under an operating table. We also formulate a new set function maximization-based approach for efficiently designing E-optimal magnet arrangements with off-the-shelf convex solvers. Our optimization method is evaluated with synthetic data and a proof-of-concept of the system is simulated.

15:40-15:45, Paper WeDT13.6	Add to My Program
Measuring DNA Microswimmer Locomotion in Complex Flow Environments

Imamura, Taryn	Carnegie Mellon University
Kent, Teresa	Carnegie Mellon University
Taylor, Rebecca	Carnegie Mellon University
Bergbreiter, Sarah	Carnegie Mellon University
Keywords: Micro/Nano Robots, Biologically-Inspired Robots, Automation at Micro-Nano Scales Abstract: Microswimmers are sub-millimeter swimming robots that show potential as a platform for controllable locomotion in applications, including targeted cargo delivery and minimally invasive surgery. To be viable for these target applications, microswimmers will eventually need to be able to navigate environments with dynamic fluid flows and forces. Experimental studies with microswimmers towards this goal are currently rare because of the difficulty of isolating intentional microswimmer locomotion from environment-induced motion. In this work, we present a method for measuring microswimmer locomotion within a complex flow environment using fiducial microspheres. By tracking the particle motion of ferromagnetic and non-magnetic polystyrene fiducial microspheres, we capture the effect of fluid flow and magnetic field gradients on microswimmer trajectories. We then determine the field-driven translation of these microswimmers relative to fluid flow and demonstrate the effectiveness of this method by illustrating the motion of multiple microswimmers through different flows.

15:45-15:50, Paper WeDT13.7	Add to My Program
Position Regulation of a Conductive Nonmagnetic Object with Two Stationary Field Sources

Dalton, Devin	University of Utah
Tabor, Griffin	University of Utah
Hermans, Tucker	University of Utah
Abbott, Jake J.	University of Utah
Keywords: Dexterous Manipulation, Force Control, Manipulation Planning, Space Robotics and Automation Abstract: Recent research has shown that eddy currents induced by rotating magnetic dipole fields in conductive nonmagnetic objects can produce forces and torques that enable dexterous manipulation. This paradigm shows promise for application in the remediation of space debris. The induced force from each rotating-magnetic-dipole field source always includes a repulsive component, suggesting that the object should be surrounded by field sources to some degree to ensure the object does not leave the dexterous workspace during manipulation. In this paper, we show that it is possible to fully control the position of an object using just two stationary field sources, provided the object is near the midpoint between the field sources. A given position controller requires a low-level force controller. We propose two new force controllers and compare them with the state-of-the-art method from the literature. One of the new force controllers is particularly good at not inducing parasitic torques, which is hypothesized to be beneficial for future tasks manipulating rotating resident space objects. We perform experimental verification using numerical and physical simulators of microgr


WeDT14 Regular Session, 402	Add to My Program
Social Navigation 1

Chair: Mavrogiannis, Christoforos	University of Michigan
Co-Chair: Kästner, Linh	T-Mobile, TU Berlin

15:15-15:20, Paper WeDT14.1	Add to My Program
From Cognition to Precognition: A Future-Aware Framework for Social Navigation

Gong, Zeying	Hong Kong University of Science and Technology (Guangzhou)
Hu, Tianshuai	The Hong Kong University of Science and Technology
Qiu, Ronghe	The Hong Kong University of Science and Technology (Guangzhou)
Liang, Junwei	HKUST (Guangzhou)
Keywords: Vision-Based Navigation, Human-Aware Motion Planning Abstract: To navigate safely and efficiently in crowded spaces, robots should not only perceive the current state of the environment but also anticipate future human movements. In this paper, we propose a reinforcement learning architecture, namely Falcon, to tackle socially-aware navigation by explicitly predicting human trajectories and penalizing actions that block future human paths. To facilitate realistic evaluation, we introduce a novel SocialNav benchmark containing two new datasets, Social-HM3D and Social-MP3D. This benchmark offers large-scale photo-realistic indoor scenes populated with a reasonable amount of human agents based on scene area size, incorporating natural human movements and trajectory patterns. We conduct a detailed experimental analysis with the state-of-the-art learning-based method and two classic rule-based path-planning algorithms on the new benchmark. The results demonstrate the importance of future prediction and our method achieves the best task success rate of 55% while maintaining about 90% personal space compliance. We will release our code and datasets.

15:20-15:25, Paper WeDT14.2	Add to My Program
OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation

Narasimhan, Siddarth	University of Toronto
Tan, Aaron Hao	University of Toronto
Choi, Daniel	University of Toronto
Nejat, Goldie	University of Toronto
Keywords: Service Robotics, Human-Aware Motion Planning, Continual Learning Abstract: Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia-Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the robot trajectory planning overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for OLiVia-Nav.

15:25-15:30, Paper WeDT14.3	Add to My Program
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-Centric Navigation Using Generative-Model-Based Environment Generation

Shcherbyna, Volodymyr	Technical University Berlin
Kästner, Linh	T-Mobile, TU Berlin
Diaz, Diego	Technical University Berlin
Nguyen Huu Truong, Giang	HaNoi University of Science and Technology
Schreff, Maximilian Ho-Kyoung	Technical University Berlin
Seeger, Tim	Technical University Berlin
Kreutz, Jonas	Technical University Berlin
Martban, Ahmed	Technical University Berlin
Shen, Zhengcheng	TU Berlin
Zeng, Huajian	Technical University Munich
Soh, Harold	National University of Singapore
Keywords: Software Tools for Benchmarking and Reproducibility, Simulation and Animation, Human-Aware Motion Planning Abstract: Building upon the foundations laid by our previous work, this paper introduces Arena 4.0, a significant advancement of Arena 3.0, Arena-Bench, Arena 1.0, and Arena 2.0. Arena 4.0 provides three main novel contributions: 1) a generative-model-based world and scenario generation approach using large language models (LLMs) and diffusion models, to dynamically generate complex, human-centric environments from text prompts or 2D floorplans that can be used for development and benchmarking of social navigation strategies. 2) A comprehensive 3D model database which can be extended with 3D assets and semantically linked and annotated using a variety of metrics for dynamic spawning and arrangements inside 3D worlds. 3) The complete migration towards ROS 2, which ensures operation with state-of-the-art hardware and functionalities for improved navigation, usability, and simplified transfer towards real robots. We evaluated the platforms performance through a comprehensive user study and its world generation capabilities for benchmarking demonstrating significant improvements in usability and efficiency compared to previous versions. Arena 4.0 is openly available at https://github.com/Arena-Rosnav.

15:30-15:35, Paper WeDT14.4	Add to My Program
Active Inference-Based Planning for Safe Human-Robot Interaction: Concurrent Consideration of Human Characteristic and Rationality

Nam, Youngim	Ulsan National Institute of Science and Technology
Kwon, Cheolhyeon	Ulsan National Institute of Science and Technology
Keywords: Human-Aware Motion Planning, Safety in HRI, Planning under Uncertainty Abstract: This paper proposes a motion planning strategy for a robot to safely interact with humans exhibiting uncertain actions. The human actions are often encoded by the internal states that are attributed to human characteristics and rationality. First, by leveraging a continuous level of rationality, we compute the belief on human rationality along with his/her characteristic. This systematically reasons out the uncertainty in the observed human action, thereby better assessing the potential safety risks during the interaction. Second, based on the computed belief over the human internal states, we formulate a Stochastic Model Predictive Control (SMPC) problem to plan the robot’s actions such that it safely achieves its goal while also actively inferring on the human internal state. To cope with the expensive computation of the SMPC, we develop a sampling-based technique that efficiently evaluates the robot’s action conditioned on human uncertainty. The experiment results demonstrate that the proposed strategy excels in human action prediction, and significantly improves the safety and efficiency of Human-Robot Interaction (HRI).

15:35-15:40, Paper WeDT14.5	Add to My Program
Characterizing the Complexity of Social Robot Navigation Scenarios

Stratton, Andrew	University of Michigan
Hauser, Kris	University of Illinois at Urbana-Champaign
Mavrogiannis, Christoforos	University of Michigan
Keywords: Human-Aware Motion Planning, Performance Evaluation and Benchmarking, Human-Centered Robotics Abstract: Social robot navigation algorithms are often demonstrated in overly simplified scenarios, prohibiting the extraction of practical insights about their relevance to real-world domains. Our key insight is that an understanding of the inherent complexity of a social robot navigation scenario could help characterize the limitations of existing navigation algorithms and provide actionable directions for improvement. Through an exploration of recent literature, we identify a series of factors contributing to the complexity of a scenario, disambiguating between contextual and robot-related ones. We then conduct a simulation study investigating how manipulations of contextual factors impact the performance of a variety of navigation algorithms. We find that dense and narrow environments correlate most strongly with performance drops, while the heterogeneity of agent policies and directionality of interactions have a less pronounced effect. This motivates a shift towards developing and testing algorithms under higher-complexity settings.

15:40-15:45, Paper WeDT14.6	Add to My Program
Domain Randomization for Learning to Navigate in Human Environments (Resubmission)

Ah Sen, Nick	Monash University
Kulic, Dana	Monash University
Carreno, Pamela	Monash University
Keywords: Human-Aware Motion Planning, Reinforcement Learning Abstract: In shared human-robot environments, effective navigation requires robots to adapt to various pedestrian behaviors encountered in the real world. Most existing deep reinforcement learning algorithms for human-aware robot navigation typically assume that pedestrians adhere to a single walking behavior during training, limiting their practicality/performance in scenarios where pedestrians exhibit various types of behavior. In this work, we propose to enhance the generalization capabilities of human-aware robot navigation by employing Domain Randomization (DR) techniques to train navigation policies on a diverse range of simulated pedestrian behaviors with the hope of better generalization to the real world. We evaluate the effectiveness of our method by comparing the generalization capabilities of a robot navigation policy trained with and without DR, both in simulations and through a real-user study, focusing on adaptability to different pedestrian behaviors, performance in novel environments, and users' perceived comfort, sociability and naturalness. Our findings reveal that the use of DR significantly enhances the robot's social compliance in both simulated and real-life contexts.


WeDT15 Regular Session, 403	Add to My Program
Manipulation Planning

Chair: Cheng, Xianyi	Duke University
Co-Chair: Shirai, Yuki	Mitsubishi Electric Research Laboratories

15:15-15:20, Paper WeDT15.1	Add to My Program
Characterizing Manipulation Robustness through Energy Margin and Caging Analysis

Dong, Yifei	KTH
Cheng, Xianyi	Carnegie Mellon University
Pokorny, Florian T.	KTH Royal Institute of Technology
Keywords: Manipulation Planning, Dexterous Manipulation, Grasping Abstract: To develop robust manipulation policies, quantifying robustness is essential. Evaluating robustness in general dexterous manipulation, nonetheless, poses significant challenges due to complex hybrid dynamics, combinatorial explosion of possible contact interactions, global geometry, etc. This paper introduces ``caging in motion'', an approach for analyzing manipulation robustness through energy margins and caging-based analysis. Our method assesses manipulation robustness by measuring the energy margin to failure and extends traditional caging concepts for a global analysis of dynamic manipulation. This global analysis is facilitated by a kinodynamic planning framework that naturally integrates global geometry, contact changes, and robot compliance. We validate the effectiveness of our approach in the simulation and real-world experiments of multiple dynamic manipulation scenarios, highlighting its potential to predict manipulation success and robustness.

15:20-15:25, Paper WeDT15.2	Add to My Program
Enhancing Adaptivity of Two-Fingered Object Reorientation Using Tactile-Based Online Optimization of Deconstructed Actions

Huang, Qiyin	Tsinghua University
Li, Tiemin	Tsinghua University
Jiang, Yao	Tsinghua University
Keywords: Manipulation Planning, Grippers and Other End-Effectors, Perception for Grasping and Manipulation Abstract: Object reorientation is a critical task for robotic grippers, especially when manipulating objects within constrained environments. The task poses significant challenges for motion planning due to the high-dimensional output actions with the complex input information, including unknown object properties and nonlinear contact forces. Traditional approaches simplify the problem by reducing degrees of freedom, limiting contact forms, or acquiring environment/object information in advance—significantly compromising adaptability. To address these challenges, we deconstruct the complex output actions into three fundamental types based on tactile sensing: task-oriented actions, constraint-oriented actions, and coordinating actions. These actions are then optimized online using gradient optimization to enhance adaptability. Key contributions include simplifying contact state perception, decomposing complex gripper actions, and enabling online action optimization for handling unknown objects or environmental constraints. Experimental results demonstrate that the proposed method is effective across a range of everyday objects, regardless of environmental contact. Additionally, the method exhibits robust performance even in the presence of unknown contacts and nonlinear external disturbances.

15:25-15:30, Paper WeDT15.3	Add to My Program
A Full-Cycle Assembly Operation: From Digital Planning to Trajectory Execution Using a Robotic Arm

Livnat, Dror	Tel Aviv University
Lavi, Yuval	Tel Aviv University
Halperin, Dan	Tel Aviv University
Keywords: Manipulation Planning, Motion and Path Planning Abstract: We present an end-to-end framework for planning tight assembly operations, where the input is a set of digital models, and the output is a full execution plan for a physical robotic arm, including the trajectory placement and the grasping. The framework builds on our earlier results on tight assembly planning for free-flying objects and includes the following novel components: (i) the framework itself together with physical demonstrations, (ii) trajectory placement based on novel dynamic pathwise IK and (iii) post processing of the free-flying paths to relax the tightness and smooth the path. The framework provides guarantees as to the quality of the outcome trajectory. For each component we provide the algorithmic details and a full opensource software package for reproducing the process. Lastly, we demonstrate the framework with tight and challenging assembly problems (as well as puzzles, which are planned to be hard to assemble), using a UR5e robotic arm in the real world and in simulation. See the figure at the top for a physical UR5e assembling the alpha-z puzzle (known to be considerably more complicated to assemble than the celebrated alpha puzzle). Full video clips of all the assembly demonstrations together with our open source software are available at our project page: https://tau-cgl.github.io/Full-Cycle-Assembly-Operation/

15:30-15:35, Paper WeDT15.4	Add to My Program
Robust Nonprehensile Dynamic Object Transportation: A Closed-Loop Sensitivity Approach

Teimoorzadeh, Ainoor	Technical University of Munich
Pupa, Andrea	University of Modena and Reggio Emilia
Selvaggio, Mario	Università Degli Studi Di Napoli Federico II
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Manipulation Planning, Planning under Uncertainty, Motion and Path Planning Abstract: In this paper, we propose a closed-loop sensitivity-based approach to enhance the robustness of robotic nonprehensile dynamic manipulation tasks.The proposed method aims at fulfilling the transportation of an object, that is free to move on a tray-shaped robot end–effector, in face of not perfectly known nominal dynamic parameters. The approach is built up on taking the parameterized reference trajectory to be tracked as the optimization variable minimizing a norm of the system closed-loop sensitivity. The resulting optimal reference trajectory is inherently more robust to the parametric variations of object dynamic properties compared to a baseline straight trajectory execution. The tracking performance is assessed and validated along hardware experiments and an extensive simulation campaign assessing the superior robustness of our approach.

15:35-15:40, Paper WeDT15.5	Add to My Program
Hierarchical Contact-Rich Trajectory Optimization for Multi-Modal Manipulation Using Tight Convex Relaxations

Shirai, Yuki	Mitsubishi Electric Research Laboratories
Raghunathan, Arvind	Mitsubishi Electric Research Laboratories
Jha, Devesh	Mitsubishi Electric Research Laboratories
Keywords: Manipulation Planning, Multi-Contact Whole-Body Motion Planning and Control, Optimization and Optimal Control Abstract: Designing trajectories for manipulation through contact is challenging as it requires reasoning of object & robot trajectories as well as complex contact sequences simultaneously. In this paper, we present a novel framework for simultaneously designing trajectories of robots, objects, and contacts efficiently for contact-rich manipulation. We propose a hierarchical optimization framework where Mixed-Integer Linear Program (MILP) selects optimal contacts between robot & object using approximate dynamical constraints, and then a NonLinear Program (NLP) optimizes trajectory of the robot(s) and object considering full nonlinear constraints. We present a convex relaxation of bilinear constraints using binary encoding technique such that MILP can provide tighter solutions with better computational complexity. The proposed framework is evaluated on various manipulation tasks where it can reason about complex multi-contact interactions while providing computational advantages. We also demonstrate our framework in hardware experiments using a bimanual robot system.

15:40-15:45, Paper WeDT15.6	Add to My Program
Constraining Gaussian Process Implicit Surfaces for Robot Manipulation Via Dataset Refinement

Kumar, Abhinav	University of Michigan
Mitrano, Peter	University of Michigan
Berenson, Dmitry	University of Michigan
Keywords: Manipulation Planning, Motion and Path Planning Abstract: Model-based control faces fundamental challenges in partially-observable environments due to unmodeled obstacles. We propose an online learning and optimization method to identify and avoid unobserved obstacles online. Our method, Constraint Obeying Gaussian Implicit Surfaces (COGIS), infers contact data using a combination of visual input and state tracking, informed by predictions from a nominal dynamics model. We then fit a Gaussian process implicit surface (GPIS) to these data and refine the dataset through a novel method of enforcing constraints on the estimated surface. This allows us to design a Model Predictive Control (MPC) method that leverages the obstacle estimate to complete multiple manipulation tasks. By modeling the environment instead of attempting to directly adapt the dynamics, our method succeeds at both low-dimensional peg-in-hole tasks and high-dimensional deformable object manipulation tasks. Our method succeeds in 10/10 trials vs 1/10 for a baseline on a real-world cable manipulation task under partial observability of the environment.


WeDT16 Regular Session, 404	Add to My Program
Optimization and Trajectory Planning

Chair: Figueroa, Nadia	University of Pennsylvania
Co-Chair: Zinage, Vrushabh	University of Texas at Austin

15:15-15:20, Paper WeDT16.1	Add to My Program
Optimizing Complex Control Systems with Differentiable Simulators: A Hybrid Approach to Reinforcement Learning and Trajectory Planning

Parag, Amit	Sintef Ocean AS
Mansard, Nicolas	CNRS
Misimi, Ekrem	SINTEF Ocean
Keywords: Optimization and Optimal Control, Reinforcement Learning, Machine Learning for Robot Control Abstract: Deep reinforcement learning (RL) often relies on simulators as abstract oracles to model interactions within complex environments. While differentiable simulators have recently emerged for multi-body robotic systems, they remain underutilized, despite their potential to provide richer information. This underutilization, coupled with the high computational cost of exploration-exploitation in high-dimensional state spaces, limits the practical application of RL in the real-world. We propose a method that integrates learning with differentiable simulators to enhance the efficiency of exploration-exploitation. Our approach learns value functions, state trajectories, and control policies from locally optimal runs of a model-based trajectory optimizer. The learned value function acts as a proxy to shorten the preview horizon, while approximated state and control policies guide the trajectory optimization. We benchmark our algorithm on three classical control problems and a torque-controlled 7 degree-of-freedom robot manipulator arm, demonstrating faster convergence and a more efficient symbiotic relationship between learning and simulation for end-to-end training of complex, poly-articulated systems.

15:20-15:25, Paper WeDT16.2	Add to My Program
TransformerMPC: Accelerating Model Predictive Control Via Transformers

Zinage, Vrushabh	University of Texas at Austin
Khalil, Ahmed	The University of Texas at Austin
Bakolas, Efstathios	The University of Texas at Austin
Keywords: Optimization and Optimal Control, AI-Based Methods, Autonomous Agents Abstract: In this paper, we address the problem of reducing the computational burden of Model Predictive Control (MPC) for real-time robotic applications. We propose TransformerMPC, a method that enhances the computational efficiency of MPC algorithms by leveraging the attention mechanism in transformers for both online constraint removal and better warm start initialization. Specifically, TransformerMPC accelerates the computation of optimal control inputs by selecting only the active constraints to be included in the MPC problem, while simultaneously providing a warm start to the optimization process. This approach ensures that the original constraints are satisfied at optimality. TransformerMPC is designed to be seamlessly integrated with any solver, irrespective of its implementation. To guarantee constraint satisfaction after removing inactive constraints, we perform an offline verification to ensure that the optimal control inputs generated by the solver meet all constraints. The effectiveness of TransformerMPC is demonstrated through extensive numerical simulations on complex robotic systems, achieving up to 35x improvement in runtime without any loss in performance.

15:25-15:30, Paper WeDT16.3	Add to My Program
A New Semidefinite Relaxation for Linear and Piecewise Affine Optimal Control with Time Scaling

Yang, Lujie	MIT
Marcucci, Tobia	Massachusetts Institute of Technology
Parrilo, Pablo	MIT
Tedrake, Russ	Massachusetts Institute of Technology
Keywords: Optimization and Optimal Control, Motion and Path Planning Abstract: We introduce a semidefinite relaxation for optimal control of linear systems with time scaling. These problems are inherently nonconvex, since the system dynamics involves bilinear products between the discretization time step and the system state and controls. The proposed relaxation is closely related to the standard second-order semidefinite relaxation for quadratic constraints, but we carefully select a subset of the possible bilinear terms and apply a change of variables to achieve empirically tight relaxations while keeping the computational load light. We further extend our method to handle piecewise-affine (PWA) systems by formulating the PWA optimal-control problem as a shortest-path problem in a graph of convex sets (GCS). In this GCS, different paths represent different mode sequences for the PWA system, and the convex sets model the relaxed dynamics within each mode. By combining a tight convex relaxation of the GCS problem with our semidefinite relaxation with time scaling, we can solve PWA optimal-control problems through a single semidefinite program.

15:30-15:35, Paper WeDT16.4	Add to My Program
C-Uniform Trajectory Sampling for Fast Motion Planning

Poyrazoglu, Oguzhan Goktug	University of Minnesota
Cao, Yukang	University of Minnesota
Isler, Volkan	University of Minnesota
Keywords: Optimization and Optimal Control, Motion and Path Planning, Collision Avoidance Abstract: We study the problem of sampling robot trajectories and introduce the notion of C-Uniformity. As opposed to the standard method of uniformly sampling control inputs (which lead to biased samples of the configuration space), C-Uniform trajectories are generated by control actions which lead to uniform sampling of the configuration space. After presenting an intuitive closed-form solution to generate C-Uniform trajectories for the 1D random-walker, we present a network flow based optimization method to precompute C-Uniform trajectories for general robot systems. We apply the notion of C-Uniformity to the design of Model Predictive Path Integral controllers. Through simulation experiments, we show that using C-Uniform trajectories significantly improves the performance of MPPI-style controllers, achieving up to 40% coverage performance gain compared to the best baseline. We demonstrate the practical applicability of our method with an implementation on a 1/10th scale racer.

15:35-15:40, Paper WeDT16.5	Add to My Program
ADMM-MCBF-LCA: A Layered Control Architecture for Safe Real-Time Navigation

Srikanthan, Anusha	University of Pennsylvania
Xue, Yifan	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Matni, Nikolai	University of Pennsylvania
Figueroa, Nadia	University of Pennsylvania
Keywords: Optimization and Optimal Control, Integrated Planning and Control Abstract: We consider the problem of safe real-time navigation of a robot in a dynamic environment with moving obstacles of arbitrary smooth geometries and input saturation constraints. We assume that the robot detects and models nearby obstacle boundaries with a short-range sensor and that this detection is error-free. This problem presents three main challenges: i) input constraints, ii) safety, and iii) real-time computation. To tackle all three challenges, we present a layered control architecture (LCA) consisting of an offline path library generation layer, and an online path selection and safety layer. To overcome the limitations of reactive methods, our offline path library consists of feasible controllers, feedback gains, and reference trajectories. To handle computational burden and safety, we solve online path selection and generate safe inputs that run at 100 Hz. Through simulations on Gazebo and Fetch hardware in an indoor environment, we evaluate our approach against baselines that are layered, end-to-end, or reactive. Our experiments demonstrate that among all algorithms, only our proposed LCA is able to complete tasks such as reaching a goal, safely. When comparing metrics such as safety, input error, and success rate, we show that our approach generates safe and feasible inputs throughout the robot execution.

15:40-15:45, Paper WeDT16.6	Add to My Program
Experimental Validation of Sensitivity-Aware Trajectory Planning for a Redundant Robotic Manipulator under Payload Uncertainty

Srour, Ali	CNRS
Franchi, Antonio	University of Twente / Sapienza University of Rome
Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Cognetti, Marco	LAAS-CNRS and Université Toulouse III - Paul Sabatier
Keywords: Optimization and Optimal Control, Planning under Uncertainty, Manipulation Planning Abstract: In this paper, we experimentally validate the recent concepts of closed-loop state and input sensitivity in the context of robust manipulation control for a robot manipulator. Our objective is to assess how optimizing trajectories with respect to sensitivity metrics can enhance the closed-loop system’s performance w.r.t. model uncertainties, such as those arising from payload variations during precise manipulation tasks. We conduct a series of experiments to validate our optimization approach across different trajectories, focusing primarily on evaluating the precision of the manipulator’s end-effector at critical moments where high accuracy is essential. Our findings offer valuable insights into improving the closed-loop robustness of the robot’s state and inputs against physical parametric uncertainties that could otherwise degrade the system performance.


WeDT17 Regular Session, 405	Add to My Program
Soft Robotics: Modeling, Control, and Learning

Chair: Zhang, Jianwei	University of Hamburg
Co-Chair: Sun, Ye	University of Virginia

15:15-15:20, Paper WeDT17.1	Add to My Program
Composite Learning Neural Network Tracking Control of Articulated Soft Robots

Zou, Zhigang	Sun Yat-Sen University
Li, Zhiwen	Sun Yat-Set University
Li, Weibing	Sun Yat-Sen University
Pan, Yongping	Peng Cheng Laboratory
Keywords: Model Learning for Control, Compliant Joints and Mechanisms, Neural and Fuzzy Control Abstract: Controlling articulated soft robots (ASRs) driven by variable stiffness actuators (VSAs) is challenging because they are highly nonlinear and difficult to model accurately. This paper proposes an efficient neural network (NN) learning control solution for ASRs driven by agonistic-antagonistic (AA)-VSAs to guarantee tracking performance without exact robot models. Composite learning resorts to memory regressor extension to enhance adaptive parameter estimation such that parameter convergence can be guaranteed without the stringent condition of persistent excitation. In the proposed method, an NN-based controller is constructed for the position tracking of AA-VSA- driven ASRs, and an NN weight update law based on composite learning is developed to enhance online modeling and control capabilities. Experiments are carried out on an ASR with three degrees of freedom and qbmove Advance actuators (a kind of AA-VSAs), which have validated the effectiveness and superiority of the proposed method in terms of modeling and tracking accuracy compared with existing control methods.

15:20-15:25, Paper WeDT17.2	Add to My Program
Multi-Segment Soft Robot Control Via Deep Koopman-Based Model Predictive Control

Lv, Lei	Tongji University
Liu, Lei	Tsinghua University
Bao, Lei	Beijing Soft Robot Tech Co., Ltd
Sun, Fuchun	Tsinghua University
Dong, Jiahong	Tsinghua University Affiliated Beijing Tsinghua Changgung Hospit
Zhang, Jianwei	University of Hamburg
Shan, Xuemei	Beijing Soft Robot Tech Co., Ltd
Sun, Kai	Tsinghua University
Huang, Hao	Beihang University
Luo, Yu	Tsinghua University
Keywords: Modeling, Control, and Learning for Soft Robots Abstract: Soft robots, compared to regular rigid robots, as their multiple segments with soft materials bring flexibility and compliance, have the advantages of safe interaction and dexterous operation in the environment. However, due to its characteristics of high dimensional, nonlinearity, time-varying nature, and infinite degree of freedom, it has been challenges in achieving precise and dynamic control such as trajectory tracking and position reaching. To address these challenges, we propose a framework of Deep Koopman-based Model Predictive Control (DK-MPC) for handling multi-segment soft robots. We first employ a deep learning approach with sampling data to approximate the Koopman operator, which therefore linearizes the high-dimensional nonlinear dynamics of the soft robots into a finite-dimensional linear representation. Secondly, this linearized model is utilized within a model predictive control framework to compute optimal control inputs that minimize the tracking error between the desired and actual state trajectories. The real-world experiments on the soft robot “Chordata” demonstrate that DK-MPC could achieve high-precision control, showing the potential of DK-MPC for future applications to soft robots. More visualization results can be found at https://pinkmoon-io.github.io/DKMPC/.

15:25-15:30, Paper WeDT17.3	Add to My Program
Physics-Informed Split Koopman Operators for Data-Efficient Soft Robotic Simulation

Ristich, Eron	Arizona State University
Zhang, Lei	Arizona State University
Ren, Yi	Arizona State University
Sun, Jiefeng	Arizona State University
Keywords: Modeling, Control, and Learning for Soft Robots, Model Learning for Control, Dynamics Abstract: Koopman operator theory provides a powerful data-driven technique for modeling nonlinear dynamical systems in a linear framework, in comparison to computationally expensive and highly nonlinear physics-based simulations. However, Koopman operator-based models for soft robots are very high dimensional and require considerable amounts of data to properly resolve. Inspired by physics-informed techniques from machine learning, we present a novel physics-informed Koopman operator identification method that improves simulation accuracy for small dataset sizes. Through Strang splitting, the method takes advantage of both continuous and discrete Koopman operator approximation to obtain information both from trajectory and phase space data. The method is validated on a tendon-driven soft robotic arm, showing orders of magnitude improvement over standard methods in terms of the shape error. We envision this method can significantly reduce the data requirement of Koopman operators for systems with partially known physical models, and thus reduce the cost of obtaining data. More info: https://sunrobotics.lab.asu.edu/blog/2024/ristich-icra-2025/

15:30-15:35, Paper WeDT17.4	Add to My Program
Robust Swimming Controller for Soft Robots Via Drop-Out Learning

Monica, Josephine	Cornell University
Campbell, Mark	Cornell University
Keywords: Soft Robot Applications, Reinforcement Learning, Robust/Adaptive Control Abstract: A novel framework for training a robotic fish to learn how to swim, even in the presence of degradations or failures in actuators is developed. Robotic underwater robots, particularly soft fish-inspired designs have gained significant attention due to their distinct benefits, including superior maneuverability, energy efficiency, versatile applications, and seamless integration with marine environments. However, their material properties and actuators can degrade, leading to pre-mature system failures. In this paper, we introduce the concept of actuator drop-out during training, to enable the robot to learn how to swim even when one or more actuators are degraded or non-functional. A Soft Actor-Critic Deep Reinforcement Learning architecture is used to learn a policy, with actuator degradations/failures introduced during training. A four actuator koi fish is modeled and simulated using the FishGym environment. Navigation-based validation tests show little degradation with one actuator failure, and much more robust swimming behaviors and performance compared to training with no failures, even when two or three actuators fail. These results will improve long-term operational reliability, ensuring robot fish functionality even in challenging underwater conditions.

15:35-15:40, Paper WeDT17.5	Add to My Program
Optimal Gait Control for a Tendon-Driven Soft Quadruped Robot by Model-Based Reinforcement Learning

Niu, Xuezhi	Uppsala University
Tan, Kaige	KTH Royal Institute of Technology
Gurdur Broo, Didem	Uppsala University
Feng, Lei	KTH Royal Institute of Technology
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Reinforcement Learning Abstract: This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four compressible tendon-driven soft actuators. Soft quadruped robots, compared to their rigid counterparts, are widely recognized for offering enhanced safety, lower weight, and simpler fabrication and control mechanisms. However, their highly deformable structure introduces nonlinear dynamics, making precise gait locomotion control complex. To solve this problem, we propose a novel model-based reinforcement learning (MBRL) method. The study employs a multi-stage approach, including state space restriction, data-driven surrogate model training, and MBRL development. Compared to benchmark methods, the proposed approach significantly improves the efficiency and performance of gait control policies. The developed policy is both robust and adaptable to the robot's deformable morphology. The study concludes by highlighting the practical applicability of these findings in real-world scenarios.

15:40-15:45, Paper WeDT17.6	Add to My Program
Physics-Guided Deep Learning Enabled Surrogate Modeling for Pneumatic Soft Robots

Beaber, Sameh I.	University of Virginia
Liu, Zhen	University of Virginia
Sun, Ye	University of Virginia
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: Soft robots, formulated by soft and compliant materials, have grown significantly in recent years toward safe and adaptable operations and interactions with dynamic environments. Modeling the complex, nonlinear behaviors and controlling the deformable structures of soft robots present challenges. This study aims to establish a physics-guided deep learning (PGDL) computational framework that integrates physical models into deep learning framework as surrogate models for soft robots. Once trained, these models can replace computationally expensive numerical simulations to shorten the computation time and enable real-time control. This PGDL framework is among the first to integrate first principle physics of soft robots into deep learning toward highly accurate yet computationally affordable models for soft robot modeling and control. The proposed framework has been implemented and validated using three different pneumatic soft fingers with different behaviors and geometries, along with two training and testing approaches, to demonstrate its effectiveness and generalizability. The results showed that the mean square error (MSE) of predicted deformed curvature and the maximum and minimum deformation at various loading conditions were as low as 10−4 mm2. The proposed PGDL framework is constructed from first principle physics and intrinsically can be applicable to various conditions by carefully considering the governing equations, auxiliary equations, and the corresponding boundary and initial conditions.

15:45-15:50, Paper WeDT17.7	Add to My Program
Learning-Based Nonlinear Model Predictive Control of Articulated Soft Robots Using Recurrent Neural Networks

Schaefke, Hendrik	Leibniz University Hannover
Habich, Tim-Lukas	Leibniz University Hannover
Muhmann, Christian	Leibniz University Hannover
Ehlers, Simon F. G.	Leibniz University Hannover
Seel, Thomas	Leibniz Universität Hannover
Schappler, Moritz	Institute of Mechatronic Systems, Leibniz Universitaet Hannover
Keywords: Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control, Optimization and Optimal Control Abstract: Soft robots pose difficulties in terms of control, requiring novel strategies to effectively manipulate their compliant structures. Model-based approaches face challenges due to the high dimensionality and nonlinearities such as hysteresis effects. In contrast, learning-based approaches provide nonlinear models of different soft robots based only on measured data. In this paper, recurrent neural networks (RNNs) predict the behavior of an articulated soft robot (ASR) with five degrees of freedom (DoF). RNNs based on gated recurrent units (GRUs) are compared to the more commonly used long short-term memory (LSTM) networks and show better accuracy. The recurrence enables to capture hysteresis effects that are inherent in soft robots due to viscoelasticity or friction but cannot be captured by simple feedforward networks. The data-driven model is used within a nonlinear model predictive control (NMPC), whereby the correct handling of the RNN's hidden states is focused. A training approach is presented that allows measured values to be utilized in each control cycle. This enables accurate predictions of short horizons based on sensor data, which is crucial for closed-loop NMPC. The proposed learning-based NMPC enables trajectory tracking with an average error of 1.2 deg in experiments with the pneumatic five-DoF ASR.


WeDT18 Regular Session, 406	Add to My Program
Surgical Robotics: Steerable Catheters/Needles 1

Chair: Khadem, Mohsen	University of Edinburgh
Co-Chair: Chitalia, Yash	University of Louisville

15:15-15:20, Paper WeDT18.1	Add to My Program
Towards a Tendon-Assisted Magnetically Steered (TAMS) Robotic Stylet for Brachytherapy

Kheradmand, Pejman	University of Louisville
Moradkhani, Behnam	University of Louisville
Jella, Harshith	University of Louisville
Sowards, Keith	Department of Radiation Oncology, University of Louisville
Silva, Scott	University of Louisville
Chitalia, Yash	University of Louisville
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Mechanism Design Abstract: Interstitial brachytherapy requires up to 20 straight needles to surround and irradiate deep-seated tumors, but may offer sub-optimal radiation dosage in cases of advanced cancers. A steerable stylet can be used to guide the needle within the tissue, improving procedure accuracy and reducing the number of needles required for each operation. This work introduces the design of a novel tendon-assisted magnetically steered (TAMS) robotic stylet to steer commercially available brachytherapy needles. The dual-actuation modality (magnetic and tendon-driven) allows for increased bending compliance while retaining axial rigidity at extremely small diameters (OD: 1.4 mm), key properties for steering hollow needles from within their lumen. We also develop a two-tube Cosserat rod model that estimates the behavior of the TAMS robot and needle assembly under actuation from tendons, external magnetic fields, and finally combined magnet+tendon forces. We validate our model in free space and demonstrate the capability of the TAMS robot and dual-actuation modalities to steer brachytherapy needles to high curvatures inside phantom tissue.

15:20-15:25, Paper WeDT18.2	Add to My Program
VascularPilot3D: Toward a 3D Fully Autonomous Navigation for Endovascular Robotics

Song, Jingwei	University of Michigan
Yang, Keke	United Imaging
Chen, Han	Shanghai United Imaging Medical High-Tech Research Institute Co
Liu, Jiayi	United Imaging
Gu, Yinan	Shanghai United Imaging Healthcare High Tech Research Institute
Hui, Qianxin	Shanghai United Imaging Healthcare Advance Technology Research I
Huang, Yanqi	Shanghai United Imaging Healthcare Co., LTD
Li, Meng	Shanghai United Imaging Healthcare Co., Ltd
Zhang, Zheng	1. the Institute of Medical Imaging Technology, School of Biomed
Cao, Tuoyu	United Imaging Healthcare
Ghaffari, Maani	University of Michigan
Keywords: Surgical Robotics: Steerable Catheters/Needles, Vision-Based Navigation, Computer Vision for Medical Robotics Abstract: This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigation.

15:25-15:30, Paper WeDT18.3	Add to My Program
Weakly-Supervised Learning Via Multi-Lateral Decoder Branching for Tool Segmentation in Robot-Assisted Cardiovascular Catheterization

Omisore, Olatunji Mumini	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Akinyemi, Toluwanimi	Shenzhen Institute of Advanced Technology
Nguyen, Anh	University of Liverpool
Wang, Lei	Shenzhen Institutes of Advanced Technology, Chinese Academy of S
Keywords: Surgical Robotics: Steerable Catheters/Needles, Object Detection, Segmentation and Categorization, Medical Robots and Systems Abstract: Robot-assisted catheterization has garnered a good attention for its potentials in treating cardiovascular diseases. However, advancing surgeon-robot collaboration still requires further research, particularly on task-specific automation. For instance, automated tool segmentation can assist surgeons in visualizing and tracking endovascular tools during procedures. While learning-based models have demonstrated state-of-the- art segmentation performances, generating ground-truth labels for fully-supervised methods is labor-intensive, time consuming, and costly. In this study, we developed a weakly-supervised learning method that is based on multi-lateral pseudo labeling for tool segmentation in cardiovascular angiogram datasets. The method utilizes a modified U-Net architecture featuring one encoder and multiple laterally branched decoders. The decoders generate diverse pseudo labels under different perturbations to augment the available partial annotation for model training. A mixed loss function with shared consistency was adapted for this purpose. The weakly-supervised model was trained end-to-end and validated using partially annotated angiogram data from three cardiovascular catheterization procedures. Validation results show that the weakly-supervised model could perform closer to fully-supervised models. Furthermore, the proposed multi-lateral approach outperforms three well known weakly- supervised learning methods, offering the highest segmentation performance across the three angiogram datasets. Numerous ablation studies confirmed the model’s consistent performance under different settings. Finally, the model was applied for tool segmentation in a robot-assisted catheterization experiments. The model enhanced visualization with high connectivity indices for guidewire and catheter, and a mean segmentation time of 35.26ms per frame. This study provides a fast, stable, and less expensive method for tool segmentation and visualization in robotic catheterization.

15:30-15:35, Paper WeDT18.4	Add to My Program
Towards Evaluating the User Comfort and Experience of a Novel Steerable Drilling Robotic System in Pedicle Screw Fixation Procedures: A User Study

Sharma, Susheela	University of Texas at Austin
Racz, Frigyes Samuel	The University of Texas at Austin
Go, Sarah	University of Texas at Austin
Kapuria, Siddhartha	University of Texas at Austin
Rezayof, Omid	University of Texas at Austin
Amadio, Jordan P.	University of Texas Dell Medical School
Khadem, Mohsen	University of Edinburgh
Millán, José del R.	The University of Texas at Austin
Alambeigi, Farshid	University of Texas at Austin
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Human-Robot Collaboration Abstract: Aiming at developing a safe, intuitive, and collaborative steerable drilling robotic system for pedicle screw fixation procedures, in this paper, we leverage our recently developed steerable drilling robotic framework, and developed a collaborative drilling mode to control this system. In this control mode, first a user positions a concentric tube steerable drilling robot (CT-SDR) in the workspace and aligns it based on a pre-planned trajectory. Next, the CT-SDR is directly controlled by the user through an admittance mode to perform a drilling procedure and creating a J-shape tunnel. To evaluate the user comfort and intuitiveness of the drilling procedure using this system and the proposed control interface, we performed a user study with 11 subjects, who had no prior experience in using this system. The results of this study were analyzed using various qualitative and quantitative metrics.

15:35-15:40, Paper WeDT18.5	Add to My Program
Minimally Invasive Endotracheal Inside-Out Flexible Needle Driving System towards Microendoscope-Guided Robotic Tracheostomy

Lin, Botao	The Chinese University of Hong Kong
Yuan, Sishen	The Chinese University of Hong Kong
Zhang, Tinghua	The Chinese University of Hong Kong
Zhang, Tao	Chinese University of Hong Kong
Hao, Ruoyi	The Chinese University of Hong Kong
Yuan, Wu	The Chinese University of Hong Kong
Lim, Chwee Ming	National University of Singapore
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles Abstract: Open tracheostomy (OT) is considered the traditional way and golden standard for treating airway obstruction patients. However, OT has many unavoidable drawbacks, including strict performing scenarios, significant scarring, and the risk of surgeon infection. Percutaneous dilation tracheostomy (PDT) emerges, with advantages including a lower cost, smaller scarring, and better protection of surgeons from inflecting by aerosol. However, the outside-in puncture manner of PDT has a risk of piercing the post-tracheal wall and the esophagus with uncontrolled force. Additionally, locating tracheal rings and determining the puncture site externally can be challenging for certain patients, such as those who are obese or have undergone neck surgery, while this procedure typically relies on palpation and the surgeon's expertise. Hence, to improve the safety and simplicity of tracheostomy, a minimally-invasive endotracheal inside-out flexible needle-driving system towards microendoscope-guided robotic tracheostomy (MERT) has been proposed in this paper. Guided by an optical coherence tomography (OCT) probe and a microendoscope, the robot inserts into the trachea and performs an inside-out puncture using a flexible needle. The robot can work through a standard endotracheal tube (ETT), and the puncture direction of the flexible needle is variable. Kinematics and statics models of the flexible needle have been derived, and the minimum position errors generated in the kinematics and statics validation experiments are 0.57 pm 0.21 mm and 0.27 pm 0.21 mm. Finally, a porcine trachea puncture experiment is carried out, and the feasibility of the proposed system is verified.

15:40-15:45, Paper WeDT18.6	Add to My Program
Comparison of Classical, Neural Network and Hybrid Models for Hysteretic Single-Tendon Catheter Kinematics

Wang, Yuan	Boston Children's Hospital, Harvard Medical School
Dupont, Pierre	Children's Hospital Boston, Harvard Medical School
Keywords: Surgical Robotics: Steerable Catheters/Needles, Kinematics, Deep Learning Methods Abstract: While robotic control of catheter motion can improve tip positioning accuracy, hysteresis arising from tendon friction and flexural deformation degrades kinematic modeling accuracy. In this paper, we compare the capabilities of three types of models for representing the forward and inverse kinematic maps of a clinical single-tendon cardiac catheter. Classical hysteresis models, neural networks and hybrid combinations of the two are included. Our results show that modeling accuracy is best when models are trained using motions corresponding to the anticipated clinical motions. For sinusoidal motions, recurrent neural network models provide the best performance. For point-to-point motions, however, a simple backlash model can provide comparable performance to a recurrent neural network.


WeDT19 Regular Session, 407	Add to My Program
Novel Methods for Mapping and Localization

Chair: Paley, Derek	University of Maryland
Co-Chair: Kim, Ayoung	Seoul National University

15:15-15:20, Paper WeDT19.1	Add to My Program
Fieldscale: Locality-Aware Field-Based Adaptive Rescaling for Thermal Infrared Image

Gil, Hyeonjae	SNU
Jeon, Myung-Hwan	UIUC
Kim, Ayoung	Seoul National University
Keywords: Computer Vision for Transportation, Recognition, Deep Learning for Visual Perception Abstract: Thermal infrared (TIR) cameras are emerging as promising sensors in safety-related fields due to their robustness against external illumination. However, RAW TIR image has 14 bits of pixel depth and needs to be rescaled into 8 bits for general applications. Previous works utilize a global 1D look-up table to compute pixel-wise gain solely based on its intensity, which degrades image quality by failing to consider the local nature of the heat. We propose Fieldscale, a rescaling based on locality-aware 2D fields where both the intensity value and spatial context of each pixel within an image are embedded. It can adaptively determine the pixel gain for each region and produce spatially consistent 8-bit rescaled images with minimal information loss and high visibility. Consistent performance improvement on image quality assessment and two other downstream tasks support the effectiveness and usability of Fieldscale. All the codes are publicly opened to facilitate research advancements in this field. https://github.com/hyeonjaegil/fieldscale

15:20-15:25, Paper WeDT19.2	Add to My Program
Evaluating Global Geo-Alignment for Precision Learned Autonomous Vehicle Localization Using Aerial Data

Yang, Yi	Nuro Inc
Zhao, Xuran	Nuro
Zhao, Haicheng Charles	Nuro
Yuan, Shumin	Nuro AI
Bateman, Samuel	Nuro
Huang, Tiffany A.	Mercedes-Benz Research & Development North America
Beall, Chris	Georgia Institute of Technology
Maddern, Will	Nuro
Keywords: Localization, Mapping, Intelligent Transportation Systems Abstract: Recently there has been growing interest in the use of aerial and satellite map data for autonomous vehicles, primarily due to its potential for significant cost reduction and enhanced scalability. Despite the advantages, aerial data also comes with challenges such as a sensor-modality gap and a viewpoint difference gap. Learned localization methods have shown promise for overcoming these challenges to provide precise metric localization for autonomous vehicles. Most learned localization methods rely on coarsely aligned ground truth, or implicit consistency-based methods to learn the localization task – however, in this paper we find that improving the alignment between aerial data and autonomous vehicle sensor data at training time is critical to the performance of a learning-based localization system. We compare two data alignment methods using a factor graph framework and, using these methods, we then evaluate the effects of closely aligned ground truth on learned localization accuracy through ablation studies. Finally, we evaluate a learned localization system using the data alignment methods on a comprehensive (1600km) autonomous vehicle dataset and demonstrate localization error below 0.3m and 0.5◦ sufficient for autonomous vehicle applications.

15:25-15:30, Paper WeDT19.3	Add to My Program
Under Pressure: Altimeter-Aided ICP for 3D Maps Consistency

Dubois, William	Université Laval
Samson, Nicolas	Université Laval
Daum, Effie	Université Laval
Laconte, Johann	French National Research Institute for Agriculture, Food and The
Pomerleau, Francois	Université Laval
Keywords: Localization, Mapping, Field Robots Abstract: We propose a novel method to enhance the accuracy of the Iterative Closest Point (ICP) algorithm by integrating altitude constraints from a barometric pressure sensor. While ICP is widely used in mobile robotics for Simultaneous Localization and Mapping (SLAM), it is susceptible to drift, especially in underconstrained environments such as vertical shafts. To address this issue, we propose to augment ICP with altimeter measurements, reliably constraining drifts along the gravity vector. To demonstrate the potential of altimetry in SLAM, we offer an analysis of calibration procedures and noise sensitivity of various pressure sensors, improving measurements to centimeter-level accuracy. Leveraging this accuracy, we propose a novel ICP formulation that integrates altitude measurements along the gravity vector, thus simplifying the optimization problem to 3-Degree Of Freedom (DOF). Experimental results from real-world deployments demonstrate that our method reduces vertical drift by 84% and improves overall localization accuracy compared to state-of-the-art methods in non-planar environments.

15:30-15:35, Paper WeDT19.4	Add to My Program
Neural Ranging Inertial Odometry

Wang, Si	Zhejiang University
Shen, Bingqi	Zhejiang University
Wang, Fei	Beijing Institute of Electronic System Engineering
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Localization, Range Sensing, Deep Learning Methods Abstract: Ultra-wideband (UWB) has shown promising potential in GPS-denied localization thanks to its lightweight and drift-free characteristics, while the accuracy is limited in real scenarios due to its sensitivity to sensor arrangement and non-Gaussian pattern induced by multi-path or multi-signal interference, which commonly occurs in many typical applications like long tunnels. We introduce a novel neural fusion framework for ranging inertial odometry which involves a graph attention UWB network and a recurrent neural inertial network. Our graph net learns scene-relevant ranging patterns and adapts to any number of anchors or tags, realizing accurate positioning without calibration. Additionally, the integration of least squares and the incorporation of nominal frame enhance overall performance and scalability. The effectiveness and robustness of our methods are validated through extensive experiments on both public and self-collected datasets, spanning indoor, outdoor, and tunnel environments. The results demonstrate the superiority of our proposed IR-ULSG in handling challenging conditions, including scenarios outside the convex envelope and cases where only a single anchor is available.

15:35-15:40, Paper WeDT19.5	Add to My Program
Robust Preintegrated Wheel Odometry for Off-Road Autonomous Ground Vehicles

Potokar, Easton	Carnegie Mellon Uiversity
McGann, Daniel	Carnegie Mellon University
Kaess, Michael	Carnegie Mellon University
Keywords: Localization, Wheeled Robots, Field Robots Abstract: Wheel odometry is not often used in state estimation for off-road vehicles due to frequent wheel slippage, varying wheel radii, and the 3D motion of the vehicle not fitting with the 2D nature of integrated wheel odometry. This paper attempts to overcome these issues by proposing a novel 3D preintegration of wheel encoder measurements on manifold. Our method additionally estimates wheel slip, radii, and baseline online to improve accuracy and robustness. Further, due to the preintegration, many measurements can be summarized into a single motion constraint using first-order updates for wheel slippage and intrinsics, allowing for efficient usage in an optimization-based state estimation framework. While our method can be used with any sensors in a factor graph framework, we validate its effectiveness and observability of parameters in a vision-wheel-odometry system (VWO) in a Monte Carlo simulation. Additionally, we illustrate its accuracy and demonstrate it can be used to overcome other sensor failures in real-world off-road scenarios in both a VWO and visual-inertial-wheel odometry (VIWO) system.

15:40-15:45, Paper WeDT19.6	Add to My Program
Air-Ground Collaboration with SPOMP: Semantic Panoramic Online Mapping and Planning (I)

Miller, Ian	Burro
Cladera, Fernando	University of Pennsylvania
Smith, Trey	NASA Ames Research Center
Taylor, Camillo Jose	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Field Robots, Multi-Robot Systems, Aerial Systems: Perception and Autonomy Abstract: Mapping and navigation have gone hand-in-hand since long before robots existed. Maps are a key form of communication, allowing someone who has never been somewhere to nonetheless navigate that area successfully. In the context of multirobot systems, the maps and information that flow between robots are necessary for effective collaboration, whether those robots are operating concurrently, sequentially, or completely asynchronously. In this article, we argue that maps must go beyond encoding purely geometric or visual information to enable increasingly complex autonomy, particularly between robots. We propose a framework for multirobot autonomy, focusing in particular on air and ground robots operating in outdoor 2.5-D environments. We show that semantic maps can enable the specification, planning, and execution of complex collaborative missions, including localization in Global Positioning System (GPS)-denied settings. A distinguishing characteristic of this work is that we strongly emphasize field experiments and testing, and by doing so demonstrate that these ideas can work at scale in the real world. We also perform extensive simulation experiments to validate our ideas at even larger scales. We believe that these experiments and the experimental results constitute a significant step forward toward advancing the state of the art of large-scale, collaborative multirobot systems operating with real communication, navigation, and perception constraints.

15:45-15:50, Paper WeDT19.7	Add to My Program
Visual-Inertial Localization Leveraging Skylight Polarization Pattern Constraints

Wan, Zhenhua	Guangxi University
Fu, Peng	Tsinghua University
Wang, Kunfeng	Tsinghua University
Zhao, Kaichun	Tsinghua University
Keywords: Localization, Visual-Inertial SLAM, Sensor Fusion Abstract: In this letter, we develop a tightly coupled polarization-visual-inertial localization system that utilizes naturally-attributed polarized skylight to provide a global heading. We introduce a focal plane polarization camera with negligible instantaneous field-of-view error to collect polarized skylight. Then, we design a robust heading determination method from polarized skylight and construct a global stable heading constraint. In particular, this constraint compensates for the heading unobservability present in standard VINS. In addition to the standard sparse visual feature measurements used in VINS, polarization heading residuals are constructed and co-optimized in a tightly-coupled VINS update. An adaptive fusion strategy is designed to correct the cumulative drift. Outdoor real-world experiments show that the proposed method outperforms state-of-the-art VINS-Fusion in terms of localization accuracy, and improves 22% over VINS-Fusion in a wooded campus environment.


WeDT20 Regular Session, 408	Add to My Program
Human-Robot Interaction: Physiological Sensing

Chair: Mombaur, Katja	Karlsruhe Institute of Technology
Co-Chair: Lagomarsino, Marta	Istituto Italiano Di Tecnologia

15:15-15:20, Paper WeDT20.1	Add to My Program
Promoting Trust in Industrial Human-Robot Collaboration through Preference-Based Optimization

Campagna, Giulio	Aalborg University
Lagomarsino, Marta	Istituto Italiano Di Tecnologia
Lorenzini, Marta	Istituto Italiano Di Tecnologia
Chrysostomou, Dimitrios	Aalborg University
Rehm, Matthias	Aalborg University
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Human Factors and Human-in-the-Loop, Acceptability and Trust, Human-Robot Collaboration Abstract: This paper proposes a novel theoretical framework for promoting trust in human-robot collaboration (HRC). The framework exploits Preference-Based Optimization (PBO) and focuses on three key interaction parameters: robot velocity profile, human-robot separation distance, and vertical proximity to the user’s head. By iteratively refining these parameters based on qualitative feedback from human collaborators, the system dynamically adapts robot trajectories. This personalization aims to enhance users’ confidence in the robot’s actions and foster a more trusting collaborative environment. In our user study with fourteen participants, we simulated a chemical industrial scenario for the HRC task. Results suggest that the framework effectively promotes human operator confidence in the robot assistant, particularly for individuals with limited prior experience in robotics.

15:20-15:25, Paper WeDT20.2	Add to My Program
GazeHTA: End-To-End Gaze Target Detection with Head-Target Association

Lin, Zhi-Yi	Delft University of Technology
Chew, Jouh Yeong	Honda Research Institute Japan
van Gemert, Jan C.	TU Delft
Zhang, Xucong	Delft University of Technology
Keywords: Intention Recognition, Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception Abstract: Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.

15:25-15:30, Paper WeDT20.3	Add to My Program
Gaze and Go: Harnessing Visual Attention Valence in Upper-Limb Robotic Rehabilitation with Tailored Gamification and Eye Tracking for Neuroplasticity

Wang, Daomiao	Fudan University
He, Peidong	University of Shanghai for Science and Technology
Wang, Yixi	Shanghai ZD MedTech Co., Ltd
Jian, Zhuo	Shanghai ZD MEDTECH
Song, Zilong	Fudan University
Hu, Qihan	Fudan University
Fang, Fanfu	Changhai Hospital
Yang, Cuiwei	Fudan University
Wang, Daoyu	Fudan University
Yu, Hongliu	University of Shanghai for Science and Technology
Keywords: Human Detection and Tracking, Rehabilitation Robotics, Human-Robot Collaboration Abstract: Therapeutic robotic systems have emerged as reliable tools for physical rehabilitation, providing variable-intensity movement assistance to patients with motor impairments. Robot-assisted rehabilitation facilitates the restoration mobility and dexterity, promotes functional neuroplasticity and potentially enables workforce reentry through training-induced cognitive and motor learning. To boost participant engagement and visuomotor coordination, we propose ArmGuider Pro, an advanced upper-limb training system that integrates hand-eye collaboration and gaze-triggered assistance within rehabilitation-tailored serious games. The system implements intuitive eye-tracking and visual-triggering strategies to align therapeutic interventions with participants' intentional focus, incorporating immersive gaming elements and adaptive control algorithms. Experimental validation demonstrates significant activation in motor and cognitive cerebral cortex regions, enhanced visual attention concentration in desired target areas (25.92% improvement), and improved trajectory adherence across sequential sessions (27.27% improvement). By harnessing visual attention valence, our proposed system could encourage neuroplasticity, supporting its viability for clinical application and widespread adaption in rehabilitation regimens.

15:30-15:35, Paper WeDT20.4	Add to My Program
Teleoperating a 6 DoF Robotic Manipulator from Head Movements

Poignant, Alexis	Sorbonne Université, ISIR UMR 7222 CNRS
Jarrassé, Nathanael	Sorbonne Université, ISIR UMR 7222 CNRS
Morel, Guillaume	Sorbonne Université, CNRS, INSERM
Keywords: Telerobotics and Teleoperation, Human Detection and Tracking Abstract: This article presents an interactive control approach allowing a human user to teleoperate a robotic manipulator located nearby. With this approach, the user keeps his/her hands free, as only head movements are exploited to control the robot. The controller maps the 6 Degrees of Freedom (DoF) user's head position and orientation into the 6~DoF robot end-effector position and orientation. The robot can reach a large workspace thanks to the combination of two features. Firstly, a virtual wand between the user's head and the robot end-effector converts user's head pan-tilt rotations into large displacements of the robot end-effector center perpendicularly to the wand axis (2 DoF). Secondly, for the remaining 4 DoF (robot end-effector center displacement along the wand axis and robot en-effector orientation), real-time deformation of the virtual wand is triggered when the user reaches uncomfortable configurations due to his/her head workspace limitations. Additionally, the user gets, through an Augmented Reality (AR) Headset, a non-delayed visual feedback of the current virtual wand geometry and location. The paper includes a description of the setup and the proposed controller, detailing how the robot position/orientation is coupled to the user's head position/orientation. A set of elementary experiments with a constant-geometry wand is first presented, showing workspace limitations for some DoF. Then the wand reconfiguration is introduced in the experiments, leading to full control of 6 DoF manipulation tasks throughout a large workspace.

15:35-15:40, Paper WeDT20.5	Add to My Program
Wearable Soft Sensing Band with Stretchable Sensors for Torque Estimation and Hand Gesture Recognition

Choi, Junhwan	Korea Advanced Institute of Science and Technology, (KAIST)
Feng, Jirou	Korea Advanced Institute of Science and Technology
Kim, Jung	KAIST
Keywords: Human Detection and Tracking, Intention Recognition, Wearable Robotics Abstract: This paper presents a wearable soft sensing band with stretchable sensors for monitoring muscle activity by estimating muscle volume changes. Unlike conventional surface electromyography (sEMG) sensing techniques, which require excessive pressure or adhesive electrodes, the proposed sensing method allows muscle volume variations to be detected simply by placing the device on the skin without additional pressure or adhesives. The band was evaluated in isometric-static and isometric-varying torque estimation tasks, demonstrating superior accuracy to sEMG, with a relative torque to maximum torque estimation error of less than 11.5%. In isometric-varying conditions, relative torque was estimated with an average error of 10.1% at frequencies of 0.1 Hz, 0.2 Hz, and 0.5 Hz. Furthermore, the band achieved a classification accuracy of 92.9% in recognizing ten distinct hand gestures, highlighting its capability to differentiate between multiple muscle activations. The lightweight and flexible design addresses limitations of sEMG, such as signal noise, skin irritation, and complex calibration. Experimental results validate the potential of the proposed sensing method for applications in muscle activity monitoring across healthcare, rehabilitation, and sports, and it also offers potential for use in robot teaching for reference motion generation.

15:40-15:45, Paper WeDT20.6	Add to My Program
Plug-And-Play Multi-Domain Fusion Adaptation for Cross-Subject EEG-Based Motor Imagery Classification

Shi, Kecheng	The School of Automation Engineering, University of Electronic S
Huang, Rui	University of Electronic Science and Technology of China
Li, Zhe	University of Electronic Science and Technology of China
Lyu, Jianzhi	University of Hamburg
Zhao, Yang	University of Electronic Science and Technology of China
Song, Guangkui	University of Electronic Science and Technology of China
Cheng, Hong	University of Electronic Science and Technology
Zhang, Jianwei	University of Hamburg
Keywords: Brain-Machine Interfaces, Intention Recognition Abstract: Motor imagery (MI) classification in rehabilitation brain-computer interfaces (RBCIs) faces significant challenges due to the variability of electroencephalography (EEG) signals across subjects. Existing methods typically require extensive EEG data collection from each new subject, which is time-consuming and results in poor user experience. To address this issue, this paper decompose MI EEG into subject-specific private components and shared components common across all subjects, and propose a plug-and-play domain fusion adaptive method (PPMDFA) to handle variability between subjects. In the training phase, PPMDFA introduces a Multi-Domain Fusion Graph Convolutional Network (MDFGCN) module to extract shared and private features from the MI processes of source domain subjects. In the calibration phase, the method constructs private classifiers for the target new subject using the extracted shared features combined with a small amount of labeled data. During testing, PPMDFA leverages the similarity of private components to utilize knowledge from source subjects, thereby enhancing classification accuracy for target subjects' MI. We validated the proposed method on the PhysioNet and LLMBCImotion datasets. Experimental results show that PPMDFA achieves state-of-the-art classification accuracy on both datasets, with rapid adaptation to new subjects using only 20% of the data, reaching accuracies of 73.33% and 61.62%, demonstrating strong generalization ability and robustness.

15:45-15:50, Paper WeDT20.7	Add to My Program
Learning to Communicate Functional States with Nonverbal Expressions for Improved Human-Robot Collaboration

Roy, Liam	Monash University
Croft, Elizabeth	University of Victoria
Kulic, Dana	Monash University
Keywords: Human-Robot Collaboration, Multi-Modal Perception for HRI, Social HRI Abstract: Collaborative robots must effectively communicate their internal state to humans to enable a smooth interaction. Nonverbal communication is widely used to communicate information during human-robot interaction, however, such methods may also be misunderstood, leading to communication errors. In this work, we explore modulating the acoustic parameter values (pitch bend, beats per minute, beats per loop) of nonverbal auditory expressions to convey functional robot states (accomplished, progressing, stuck). We propose a reinforcement learning (RL) algorithm based on noisy human feedback to produce accurately interpreted nonverbal auditory expressions. The proposed approach was evaluated through a user study with 24 participants. The results demonstrate that: (i) Our proposed RL-based approach is able to learn suitable acoustic parameter values which improve the users’ ability to correctly identify the state of the robot. (ii) Algorithm initialization informed by previous user data can be used to significantly speed up the learning process. (iii) The method used for algorithm initialization strongly influences whether participants converge to similar sounds for each robot state. (iv) Modulation of pitch bend has the largest influence on user association between sounds and robotic states.


WeDT21 Regular Session, 410	Add to My Program
Vision-Language-Action Models

Chair: Guy, Stephen J.	University of Minnesota - Twin Cities
Co-Chair: Arkin, Jacob	Massachusetts Institute of Technology

15:15-15:20, Paper WeDT21.1	Add to My Program
SpatialBot: Precise Spatial Understanding with Vision Language Models

Cai, Wenxiao	Stanford University
Ponomarenko, Iaroslav	Peking University
Yuan, Jianhao	University of Oxford
Li, Xiaoqi	Peking University
Yang, Wankou	Southeast University
Dong, Hao	Peking University
Zhao, Bo	Shanghai Jiao Tong University
Keywords: RGB-D Perception, Deep Learning in Grasping and Manipulation, AI-Based Methods Abstract: Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding; however, they still struggle with spatial understanding, which is fundamental to embodied AI. In this paper, we propose SpatialBot, a model designed to enhance spatial understanding by utilizing both RGB and depth images. To train VLMs for depth perception, we introduce the SpatialQA and SpatialQA-E datasets, which include multi-level depth-related questions spanning various scenarios and embodiment tasks. SpatialBench is also developed to comprehensively evaluate VLMs' spatial understanding capabilities across different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks, and embodied AI tasks demonstrate the remarkable improvements offered by SpatialBot. The model, code, and datasets are available at https://github.com/BAAI-DCAI/SpatialBot.

15:20-15:25, Paper WeDT21.2	Add to My Program
Run-Time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Hancock, Asher	Princeton University
Ren, Allen Z.	Princeton University
Majumdar, Anirudha	Princeton University
Keywords: Deep Learning Methods Abstract: Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model’s sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model finetuning or access to the model’s weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 60%.

15:25-15:30, Paper WeDT21.3	Add to My Program
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

Tang, Grace	University of California, Berkeley
Rajkumar, Swetha	University of California, Berkeley
Zhou, Yifei	University of California, Berkeley
Walke, Homer	UC Berkeley
Levine, Sergey	UC Berkeley
Fang, Kuan	Cornell University
Keywords: Deep Learning Methods, Big Data in Robotics and Automation, Deep Learning in Grasping and Manipulation Abstract: Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.

15:30-15:35, Paper WeDT21.4	Add to My Program
GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Hatch, Kyle Beltran	Toyota Research Institute
Balakrishna, Ashwin	Toyota Research Institute
Mees, Oier	University of California, Berkeley
Nair, Suraj	Stanford University
Park, Seohong	Seohong@berkeley.edu
Wulfe, Blake	Stanford University
Itkina, Masha	Stanford University
Eysenbach, Benjamin	CMU
Levine, Sergey	UC Berkeley
Kollar, Thomas	Toyota Research Institute
Burchfiel, Benjamin	Toyota Research Institute
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Imitation Learning Abstract: Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photo-realistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

15:35-15:40, Paper WeDT21.5	Add to My Program
Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Chavis, Zachary	University of Minnesota
Park, Hyun Soo	Carnegie Mellon University
Guy, Stephen J.	University of Minnesota - Twin Cities
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Motion and Path Planning Abstract: Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways --- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting representation will enable robots to use egocentric sensing to navigate to, or around, physical regions of interest for novel tasks specified in natural language.

15:40-15:45, Paper WeDT21.6	Add to My Program
QUART-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning

Tong, Xinyang	Westlake University
Ding, Pengxiang	Westlake University
Fan, Yiguo	Westlake University
Wang, Donglin	Westlake University
Zhang, Wenjie	Westlake University
Cui, Can	Westlake University
Sun, Mingyang	Westlake University
Zhao, Han	Westlake University
Zhang, Hongyin	Westlake University
Dang, Yonghao	Beijing University of Posts and Telecommunications
Huang, Siteng	Westlake Univerisity
Lyu, Shangke	Westlake University
Keywords: Perception-Action Coupling, Vision-Based Navigation, Imitation Learning Abstract: This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference at 50Hz in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.

15:45-15:50, Paper WeDT21.7	Add to My Program
IntelliRMS: A Robotic Manipulation System for Domain-Specific Tasks Using Vision and Language Foundational Models

Singh, Chandan Kumar	Tata Consultancy Services
Kumar, Devesh	Tata Consultancy Services Limited
Sanap, Vipul	TCS
Khandelwal, Mayank	Tata Consultancy Services Limited
Sinha, Rajesh	TCS-Noida
Keywords: Software Architecture for Robotic and Automation, Software-Hardware Integration for Robot Systems, AI-Enabled Robotics Abstract: Recent advancements in large language models (LLMs) have significantly enhanced machines’ ability to understand and follow human instructions. In many tasks, LLMs have demonstrated performance that rivals human-level common sense. However, directly applying LLMs to domain-specific use cases, such as robotic pick-and-place, remains a challenge. Tasks that are intuitive for humans, who rely on prior knowledge and skills, become complex for robots. Industrial robotic applications like pick-and-place require a high degree of accuracy, often exceeding 90%. In response to these challenges in domain-specific applications, we propose IntelliRMS, a novel system-oriented architecture for instruction-following robotic manipulation. The IntelliRMS synergizes the linguistic and open-vocabulary visual capabilities of foundational models to arrive at an accurate, robust and scalable system. Further, we demonstrate the effectiveness of IntelliRMS in a real-world industrial Bin-picking scenario within the retail sector, validating its performance with a comprehensive dataset.


WeDT22 Regular Session, 411	Add to My Program
Deep Learning for Visual Perception 2

Chair: Ding, Mingyu	University of North Carolina at Chapel Hill
Co-Chair: Roumeliotis, Stergios	Apple Inc

15:15-15:20, Paper WeDT22.1	Add to My Program
SCA3D: Enhancing Cross-Modal 3D Retrieval Via 3D Shape and Caption Paired Data Augmentation

Ren, Junlong	The Hong Kong University of Science and Technology (Guangzhou)
Wu, Hao	HKUST
Xiong, Hui	HKUST(GZ)
Wang, Hao	HKUST(GZ)
Keywords: Deep Learning for Visual Perception, Visual Learning, Recognition Abstract: The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover’s Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.

15:20-15:25, Paper WeDT22.2	Add to My Program
TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection

Jacobson, Philip	University of California, Berkeley
Xie, Yichen	University of California, Berkeley
Ding, Mingyu	UC Berkeley
Xu, Chenfeng	University of California, Berkeley
Tomizuka, Masayoshi	University of California
Zhan, Wei	Univeristy of California, Berkeley
Wu, Ming	University of California, Berkeley
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, AI-Based Methods Abstract: Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.

15:25-15:30, Paper WeDT22.3	Add to My Program
Single-Shot Metric Depth from Focused Plenoptic Cameras

Lasheras-Hernandez, Blanca	German Aerospace Center (DLR)
Strobl, Klaus H.	German Aerospace Center (DLR)
Izquierdo, Sergio	University of Zaragoza
Bodenmueller, Tim	German Aerospace Center (DLR)
Triebel, Rudolph	German Aerospace Center (DLR)
Civera, Javier	Universidad De Zaragoza
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision Abstract: Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multiview monocular depth offers a more compact alternative, but is constrained by the unobservability of the metric scale. Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device. However, its application to single-view dense metric depth is under-addressed mainly due to the technology’s high cost, the lack of public benchmarks, and proprietary geometrical models and software. Our work explores the potential of focused plenoptic cameras for dense metric depth. We propose a novel pipeline that predicts metric depth from a single plenoptic camera shot by first generating a sparse metric point cloud using a neural network, which is then used to scale and align a dense relative depth map regressed by a foundation depth model, resulting in a dense metric depth. To validate it, we curated the Light Field & Stereo Image Dataset (LFS) of real-world light field images with stereo depth labels, filling a current gap in existing resources. Experimental results show that our pipeline produces accurate metric depth predictions, laying a solid groundwork for future research in this field.

15:30-15:35, Paper WeDT22.4	Add to My Program
TREND: Tri-Teaching for Robust Preference-Based Reinforcement Learning with Demonstrations

Huang, Shuaiyi	University of Maryland, College Park
Levy, Mara	University of Maryland, College Park
Gupta, Anubhav	University of Maryland, College Park
Ekpo, Daniel	University of Maryland, College Park
Zheng, Ruijie	University of Maryland, College Park
Shrivastava, Abhinav	University of Maryland, College Park
Keywords: Deep Learning for Visual Perception, Deep Learning Methods Abstract: Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback.

15:35-15:40, Paper WeDT22.5	Add to My Program
SYNERGUARD: A Robust Framework for Point Cloud Classification Via Local Geometry and Spatial Topology

Zhong, Haonan	The University of New South Wales
Song, Wei	UNSW
Pagnucco, Maurice	University of New South Wales
Song, Yang	University of New South Wales
Keywords: Deep Learning for Visual Perception, Recognition, Acceptability and Trust Abstract: Point cloud recognition models are known to be vulnerable to adversarial attacks. The state-of-the-art defense solutions either focus on partial features of the point cloud, limiting their effectiveness, or rely heavily on known adversarial examples, reducing their generalizability, while others, like point cloud reconstruction, will degrade the classifier’s accuracy on clean examples. To address this, we introduce SYNERGUARD, a novel robust point cloud classification framework mitigating adversarial attacks by considering comprehensive geometric and topological attributes of the point cloud, without relying on known adversarial examples while attaining classification accuracies on clean examples. We comprehensively test SYNERGUARD against seven attack types from three leading adversarial attack approaches on two widely used datasets, ModelNet40 and ShapeNetPart. The results demonstrate SYNERGUARD’s superiority against existing defenses in mitigating adversarial attacks, as well as managing clean examples.

15:40-15:45, Paper WeDT22.6	Add to My Program
Is Discretization Fusion All You Need for Collaborative Perception?

Yang, Kang	Renmin University of China
Bu, Tianci	National University of Defense and Technology
Li, Lantao	Sony (China) Limited
Li, Chunxu	School of Information Renmin University of China
Wang, Yongcai	Renmin University of China
Li, Deying	Renmin University of China
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Intelligent Transportation Systems Abstract: Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. We will make our code available.

15:45-15:50, Paper WeDT22.7	Add to My Program
Tri-AutoAug: Single Domain Generalization for Bird's-Eye-View 3D Object Detection through Pixel-2D-3D Features

Zhao, Xue	SJTU
Peng, Pai	Cowarobot
Li, Xianfei	Cowarobot
Wang, Xinbing	Shanghai Jiao Tong University
Zhou, Chenghu	Shanghai Jiao Tong University
Ye, Nanyang	Shanghai Jiao Tong University
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Object Detection, Segmentation and Categorization Abstract: With the increasing popularity of autonomous driving based on the Bird’s-Eye-View (BEV) representation, improving the generalization of such detection models is key for safe real-world applications. However, a realistic yet challenging scenario: Single Domain Generalization (SDG) for BEV, is still under-explored. A key ingredient for SDG is to increase data diversity via common image augmentation or adversarial data generation first. However, common image-level augmentation is not sufficient enough to ensure domain diversity in most part of latent space. The adversarial generation has the problem of unstable training or mode collapsing as well. To address these limitations, we present Tri-level Automatic Augmentation (Tri-AutoAug), a simple yet effective method to enlarge the diversity and quantity of data from image and 2D features and facilitate the model to learn more domain-invariant features in BEV space. Besides, Tri-AutoAug can automatically learn augmentation strategies to avoid spending too much time manually adjusting hyperparameters and maximize the benefit of Tri-level Augmentation. To the best of our knowledge, this is the first study to explore automatic augmentation for SDG BEV. Extensive experiments on NuScenes-C including eight testing domains have demonstrated that our approach can achieve the best performance across various domain generalization methods. More importantly, we evaluate the proposed method in real-world autonomous driving scenarios. Tri-AutoAug improves the out-of-distribution (ood) performance by 8.54% (mAP), which demonstrates that Tri-AutoAug provides a practical and feasible solution for the applications of 3D detectors in the real world.The code is available at https://github.com/ClaireTun/Tri-AutoAug.


WeDT23 Regular Session, 412	Add to My Program
Learning Based Planning and Control

Chair: Faigl, Jan	Czech Technical University in Prague
Co-Chair: Zarrouk, David	Ben Gurion University

15:15-15:20, Paper WeDT23.1	Add to My Program
Motion Planning for Minimally-Actuated Serial Robots

Cohen, Avi	Ben Gurion University of the Negev
Sintov, Avishai	Tel-Aviv University
Zarrouk, David	Ben Gurion University
Keywords: Integrated Planning and Learning, Redundant Robots, Kinematics Abstract: Modern manipulators are acclaimed for their precision but often struggle to operate in confined spaces. This limitation has driven the development of hyper-redundant and continuum robots. While these present unique advantages, they face challenges in, for instance, weight, mechanical complexity, modeling and costs. The Minimally Actuated Serial Robot (MASR) has been proposed as a light-weight, low-cost and simpler alternative where passive joints are actuated with a Mobile Actuator (MA) moving along the arm. Yet, Inverse Kinematics (IK) and a general motion planning algorithm for the MASR have not be addressed. In this letter, we propose the MASR-RRT* motion planning algorithm specifically developed for the unique kinematics of MASR. The main component of the algorithm is a data-based model for solving the IK problem while considering minimal traverse of the MA. The model is trained solely using the forward kinematics of the MASR and does not require real data. With the model as a local-connection mechanism, MASR-RRT* minimizes a cost function expressing the action time. In a comprehensive analysis, we show that MASR-RRT* is superior in performance to the straight-forward implementation of the standard RRT*. Experiments on a real robot in different environments with obstacles validate the proposed algorithm.

15:20-15:25, Paper WeDT23.2	Add to My Program
Using Implicit Behavior Cloning and Dynamic Movement Primitive to Facilitate Reinforcement Learning for Robot Motion Planning

Zhang, Zengjie	Eindhoven University of Technology
Hong, Jayden	Uvic ACIS Lab
Soufi Enayati, Amir Mehdi	University of Victoria
Najjaran, Homayoun	University of Victoria
Keywords: Efficient Reinforcement Learning, Motion and Path Planning, Learning and Adaptive Systems, Learning from Demonstration Abstract: Reinforcement learning (RL) for motion planning of multi-degree-of-freedom robots still suffers from low efficiency in terms of slow training speed and poor generalizability. In this paper, we propose a novel RL-based robot motion planning framework that uses implicit behavior cloning (IBC) and dynamic movement primitive (DMP) to improve the training speed and generalizability of an off-policy RL agent. IBC utilizes human demonstration data to leverage the training speed of RL, and DMP serves as a heuristic model that transfers motion planning into a simpler planning space. To support this, we also create a human demonstration dataset using a pick-and-place experiment that can be used for similar studies. Comparison studies reveal the advantage of the proposed method over the conventional RL agents with faster training speed and higher scores. A real-robot experiment indicates the applicability of the proposed method to a simple assembly task. Our work provides a novel perspective on using motion primitives and human demonstration to leverage the performance of RL for robot applications.

15:25-15:30, Paper WeDT23.3	Add to My Program
Interpretable Active Inference Gait Control Learning

Szadkowski, Rudolf	Czech Technical University in Prague
Faigl, Jan	Czech Technical University in Prague
Keywords: Bioinspired Robot Learning, Probabilistic Inference, Learning from Experience Abstract: Sustaining the gait locomotion in an adversarial environment requires the robot to react to novel experiences adaptively. In Free Energy Principle (FEP), the behavioral reaction is driven by the discrepancy between observation and prediction. Although, for legged robot gait locomotion, the prediction of gait dynamics is challenging as the consequences non-linearly depend on the activity history, the animal gait is robust, adapting to severe motion disruptions seemingly instantly. In biomimetic robotics, the Central Pattern Generator (CPG) relaxes the general dynamics of body-environment interaction to the stable and repetitive dynamics of gait. Based on these observations, we propose self-learning of the gait dynamics model and FEP framework that infers state estimation and gait control. The proposed method is experimentally evaluated on a real hexapod walking robot with 18 controllable degrees of freedom. The robot learns the gait dynamics model indoors and then deploys it in outdoor navigation under various adversarial scenarios. Results show that the developed interpretable gait controller exhibits complex and real-time adaptive behavior when it encounters unknown situations.

15:30-15:35, Paper WeDT23.4	Add to My Program
DOPT: D-Learning with Off-Policy Target Toward Sample Efficiency and Fast Convergence Control

Shen, Zhaolong	Beihang University
Quan, Quan	Beihang University
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Learning Categories and Concepts Abstract: In recent times, Lyapunov theory has been incorporated into learning-based control methods to provide a stability guarantee. However, merely satisfying the Lyapunov conditions does not fully leverage the capabilities of the Neural Network (NN) controller. Furthermore, training an effective Lyapunov candidate requires substantial data, which inherently results in sample inefficiency. To address these limitations, we propose an off-policy variant of the vanilla D-learning method that uses current and historical data to iteratively enhance the NN controller within the framework of Lyapunov theory. Our method outperforms the Deep Deterministic Policy Gradient (DDPG) and D-learning in terms of stability, sample efficiency, and the quality of the trained controllers and Lyapunov candidates.

15:35-15:40, Paper WeDT23.5	Add to My Program
DFM: Deep Fourier Mimic for Expressive Dance Motion Learning

Watanabe, Ryo	SONY Group
Li, Chenhao	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Learning from Demonstration, Reinforcement Learning, Art and Entertainment Robotics Abstract: As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback,lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.

15:40-15:45, Paper WeDT23.6	Add to My Program
Uncertainty-Aware Deep Reinforcement Learning with Calibrated Quantile Regression and Evidential Learning

Stutts, Alex Christopher	University of Illinois Chicago
Erricolo, Danilo	University of Illinois at Chicago
Tulabandhula, Theja	University of Illinois Chicago
Mittal, Mohit	Meta Reality Labs
Trivedi, Amit Ranjan	University of Illinois at Chicago (UIC), Chicago, USA
Keywords: Deep Learning Methods, Reinforcement Learning, Planning under Uncertainty Abstract: We present a novel statistical approach to incorporate uncertainty awareness in model-free distributional deep reinforcement learning for mission and safety-critical robotics. Deep learning predictions are influenced by uncertainties in the data, termed as aleatoric uncertainties, as well as uncertainties in the learning process and model structure, known as epistemic uncertainties. The proposed algorithm, called as Calibrated Evidential Quantile Regression in Deep-Q Networks (CEQR-DQN), addresses key challenges associated with separately estimating aleatoric and epistemic uncertainty in stochastic robotic environments. It combines deep evidential learning with quantile calibration based on the principles of conformal inference to provide explicit, sample-free computations of global uncertainty as opposed to local estimates based on simple variance. Thereby, the proposed approach overcomes limitations of traditional methods in computational and statistical efficiency and handling of out-of-distribution (OOD) observations. Tested on a suite of representative miniaturized Atari games (i.e., MinAtar), CEQR-DQN is shown to surpass similar existing frameworks in scores and learning speed. Its ability to rigorously evaluate uncertainties improves exploration strategies and can serve as a blueprint for other uncertainty-aware robotic algorithms.

15:45-15:50, Paper WeDT23.7	Add to My Program
Teaching Periodic Stable Robot Motions Generation Via Sketch

Zhi, Weiming	Carnegie Mellon University
Tang, Haozhan	Carnegie Mellon University
Zhang, Tianyi	Carnegie Mellon University
Johnson-Roberson, Matthew	Carnegie Mellon University
Keywords: Machine Learning for Robot Control, Learning from Demonstration Abstract: Contemporary robots are complex systems. Teaching novel motion patterns to robots requires specialised expertise, often entailing the careful specification of robot motion or the cumbersome design of optimisation problems. In this paper, we seek to simplify the process of generating periodic motions, by teaching robots with user sketches. In particular, we tackle the problem of teaching a robot to approach a surface and then follow cyclic motion on the surface. The limit cycle of the motion can be arbitrarily specified by a single user-provided sketch over an image from the robot’s camera, and the sketched limit cycle is then projected into the scene. To generate motion that converges to the limit cycle, we contribute the Stable Periodic Diagrammatic Teaching (SPDT) framework. SPDT models the robot’s motion as an Orbitally Asymptotically Stable (O.A.S.) dynamical system that learns to stabilise based on the diagrammatic sketch provided by the user. This is achieved by applying a differentiable and invertible function, known as a diffeomorphism, to shape a known O.A.S. system. The parameterised diffeomorphism is then optimised with respect to the Hausdorff distance between the limit cycle of our modelled system and the sketch, to produce the desired robot motion. We provide insight into the behaviour of the optimised system and empirically evaluate SPDT. Results show that we can diagrammatically teach complex cyclic motion patterns with accuracy.


WeET1 Regular Session, 302	Add to My Program
Autonomous Vehicles 2

Chair: Ang Jr, Marcelo H	National University of Singapore
Co-Chair: Shi, Weisong	University of Delaware

16:35-16:40, Paper WeET1.1	Add to My Program
DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch

Sun, Shuo	National University of Singapore
Gu, Zekai	National University of Singapore
Sun, Tianchen	National University of Singapore
Sun, Jiawei	National University of Singapore
Yuan, Chengran	National Universtiy of Singapore
Han, Yuhang	National University of Singapore
Li, Dongen	National University of Singapore
Ang Jr, Marcelo H	National University of Singapore
Keywords: Big Data in Robotics and Automation, Simulation and Animation, Intelligent Transportation Systems Abstract: Realistic and diverse traffic scenarios in large quantities are crucial for the development and validation of autonomous driving systems. However, owing to numerous difficulties in the data collection process and the reliance on intensive annotations, real-world datasets lack sufficient quantity and diversity to support the increasing demand for data. This work introduces DriveSceneGen, a data-driven driving scenario generation method that learns from the real-world driving dataset and generates entire dynamic driving scenarios from scratch. Experimental results on 5k generated scenarios highlight that DriveSceneGen is able to generate novel driving scenarios that align with real-world data distributions with high fidelity and diversity. To the best of our knowledge, DriveSceneGen is the first method that generates novel driving scenarios involving both static map elements and dynamic traffic participants from scratch. Extensive experiments demonstrate that our two-stage method outperforms existing state-of-the-art map generation methods and trajectory simulation methods on their respective tasks.

16:40-16:45, Paper WeET1.2	Add to My Program
AMVP: Adaptive Multi-Volume Primitives for Auto-Driving Novel View Synthesis

Qi, Dexin	Xi'an Jiaotong University
Tao, Tao	Xi'an Jiaotong University
Zhang, Zhihong	Xi'an Jiaotong University
Mei, Xuesong	Xi'an Jiaotong University
Keywords: Deep Learning Methods, Visual Learning Abstract: Synthesizing high-quality novel views is critical to extending training data for auto-driving scenes. However, existing novel view synthesis techniques rely on a single-volume radiance field with uniform spatial resolution, constraining their model capacity and resulting in artifacts in synthesized auto-driving views. This paper introduces AMVP, a novel neural representation that models auto-driving scenes using multiple local primitives with adaptive spatial resolution. AMVP addresses the lack of representation capability of detail-rich regions by adaptively subdividing the scene into multiple local volumes. Each local volume is assigned a tailored resolution based on its geometric complexity, as determined by a density prior. Subsequently, multi-volume primitives are introduced to enable sharing a global feature table among local volumes, addressing the GPU memory inefficiency caused by the duplicated allocation. In addition, the paper proposes resolution-aware confidence, a mechanism that suppresses artifacts arising from frequency ambiguity. This mechanism adaptively reduces high-frequency components based on the spatial resolution of each local volume and the distance of the sampling point from the optical center. Experimental results on benchmark auto-driving datasets demonstrate that the proposed AMVP achieves superior rendering quality while using a similar number of parameters compared to existing methods.

16:45-16:50, Paper WeET1.3	Add to My Program
EMATO: Energy-Model-Aware Trajectory Optimization for Autonomous Driving

Tian, Zhaofeng	University of Delaware
Xia, Lichen	University of Delaware
Shi, Weisong	University of Delaware
Keywords: Energy and Environment-Aware Automation, Autonomous Vehicle Navigation, Motion and Path Planning Abstract: Autonomous driving currently lacks robust evidence of energy efficiency when using energy-model-agnostic trajectory planning. To address this, we explore how differential energy models can be effectively utilized under varying driving conditions to enhance energy efficiency. Furthermore, we propose an online nonlinear programming approach that optimizes polynomial trajectories generated by the Frenet polynomial method while incorporating traffic trajectory data and road slope predictions. Through case studies, quantitative analyses, and ablation studies conducted on both sedan and truck models, we demonstrate the effectiveness of the proposed method.

16:50-16:55, Paper WeET1.4	Add to My Program
Task-Oriented Pre-Training for Drivable Area Detection

Ma, Fulong	The Hong Kong University of Science and Technology
Zhao, Guoyang	HKUST(GZ)
Qi, Weiqing	HKUST
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Intelligent Transportation Systems, Object Detection, Segmentation and Categorization, Semantic Scene Understanding Abstract: Pre-training techniques play a crucial role in deep learning, enhancing models' performance across a variety of tasks. By initially training on large datasets and subsequently fine-tuning on task-specific data, pre-training provides a solid foundation for models, improving generalization abilities and accelerating convergence rates. This approach has seen significant success in the fields of natural language processing and computer vision. However, traditional pre-training methods necessitate large datasets and substantial computational resources, and they can only learn shared features through prolonged training and struggle to capture deeper, task-specific features. In this paper, we propose a task-oriented pre-training method that begins with generating redundant segmentation proposals using the Segment Anything (SAM) model. We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model to select proposals most closely related to the drivable area from those generated by SAM. This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data, thereby improving model's performance. Comprehensive experiments conducted on the KITTI road dataset demonstrate that our task-oriented pre-training method achieves an all-around performance improvement compared to models without pre-training. Moreover, our pre-training method not only surpasses traditional pre-training approach but also achieves the best performance compared to state-of-the-art self-training methods.

16:55-17:00, Paper WeET1.5	Add to My Program
UA-PnP: Uncertainty-Aware End-To-End Bird's Eye View Visual Perception and Prediction for Autonomous Driving

Huang, Zijian	Southern University of Science and Technology
Li, Dachuan	Southern University of Science and Technology
Hao, Qi	Southern University of Science and Technology
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation Abstract: Robust and accurate perception and prediction of the driving scenarios are crucial for autonomous driving vehicles (ADV). State-of-the-art ADV frameworks have evolved from conventional modular design to an end-to-end (E2E) pipeline that enables joint feature learning and optimization. However, the evaluation of uncertainties in the intermediate features propagated between perception and prediction units is missing in current E2E pipelines. Consequently, adverse and extreme environment factors may incur highly untrustworthy features that ultimately result in degraded perception and prediction. In this work, we propose a novel uncertainty-aware E2E visual perception and prediction framework that utilized Bird's Eye View (BEV) representations. A feature distribution estimation network is introduced to explicitly quantify the uncertainties in the intermediate BEV features extracted from the images. To better exploit temporal information and generate more robust features for scene prediction, an uncertainty-aware transformer is designed to utilize the guidance of the quantified feature uncertainty via the attention mechanism. In addition, an evidential decoder generates accurate future instance segmentations along with the associated uncertainties. Comprehensive experiments conducted on real-world dataset validate the superiority of our proposed framework over conventional pipelines. Codes are available at: https://github.com/Huang121381/UA-PnP.

17:00-17:05, Paper WeET1.6	Add to My Program
HGAT-CP: Heterogeneous Graph Attention Network for Collision Prediction in Autonomous Driving

Jiang, Yongzhi	Beihang University
Zhou, Bin	Beihang University
Li, Yongwei	Beihang University
Wu, Xinkai	Beihang University
Xiong, Zhongxia	Beihang University
Keywords: Intelligent Transportation Systems, Collision Avoidance, Autonomous Vehicle Navigation Abstract: Predicting potential collision events is beneficial to ensure the driving safety of autonomous vehicles. Existing graph-based collision prediction methods rely heavily on domain knowledge and predefined semantic relations, limiting their flexibility and adaptability in complex driving scenarios. To overcome these challenges, this paper introduces a novel collision prediction framework named HGAT-CP, which integrates a Heterogeneous Graph Attention Network (HGAT) with a Long Short-Term Memory network (LSTM) to model the spatial-temporal interactions in scenes. First, the proposed method employs a data-driven scene graph embedding module to autonomously learn relationships between vehicles and lanes and construct flexible scene graphs. Then, the HGAT module utilizes a dual-level attention mechanism, operating at both the node level and type level, to capture spatial interactions without relying on predefined semantic rules. The LSTM module models temporal dependencies of the scene graph embeddings to improve the prediction of collision events over time. Experimental evaluations on public datasets demonstrate that our proposed method achieves state-of-the-art performance, outperforming existing methods across all metrics.

17:05-17:10, Paper WeET1.7	Add to My Program
SE-STDGNN: A Self-Evolving Spatial-Temporal Directed Graph Neural Network for Multi-Vehicle Trajectory Prediction

Guo, Zixuan	The Chinese University of Hong Kong
Han, Bingxin	The Chinese University of Hong Kong
Huang, Yijun	The Chinese University of Hong Kong
Chen, Xi	The Chinese University of Hong Kong
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Intelligent Transportation Systems, Deep Learning Methods, Automation Technologies for Smart Cities Abstract: Vehicle trajectory prediction (VTP) is essential for microscopic traffic risk assessment, autonomous vehicle navigation, and traffic behavior analysis. Related research leveraging learning-based methodologies has yielded notable success on various benchmark trajectory datasets. However, these models often experience performance degradation when faced with dynamic changes in traffic conditions such as vehicle density, road types, and weather conditions, as they have not been exposed to these variations during the training process. To effectively address the need for real-time adaptation in dynamic traffic scenarios, we propose a novel framework titled self-evolving spatial-temporal directed graph neural network (SE-STDGNN). This model utilizes evolving graph convolution networks (EvolveGCNs) to aggregate spatial-temporal features of vehicles and their neighbors, which are then utilized by a trajectory prediction module to forecast future trajectories. Further, a self-evolving mechanism is introduced to adjust model parameters dynamically in the real-time operation. The efficacy of SE STDGNN is validated using the public vehicle trajectory dataset AD4CHE.

17:10-17:15, Paper WeET1.8	Add to My Program
A Generalized Control Revision Method for Autonomous Driving Safety

Zhu, Zehang	Tsinghua University
Wang, Yuning	Tsinghua University
Ke, Tianqi	School of Vehicle and Mobility, Tsinghua University
Han, Zeyu	Tsinghua University
Xu, Shaobing	Tsinghua University
Xu, Qing	Tsinghua University
Dolan, John M.	Carnegie Mellon University
Wang, Jianqiang	Tsinghua University
Keywords: Intelligent Transportation Systems, Robot Safety, Collision Avoidance Abstract: Safety is one of the most crucial challenges of autonomous driving vehicles, and one solution to guarantee safety is to employ an additional control revision module after the planning backbone. Control Barrier Function (CBF) has been widely used because of its strong mathematical foundation on safety. However, the incompatibility with heterogeneous perception data and incomplete consideration of traffic scene elements make existing systems hard to be applied in dynamic and complex real-world scenarios. In this study, we introduce a generalized control revision method for autonomous driving safety, which adopts both vectorized perception and occupancy grid map as inputs and comprehensively models multiple types of traffic scene constraints based on a new proposed barrier function. Traffic elements are integrated into one unified framework, decoupled from specific scenario settings or rules. Experiments on CARLA, SUMO, and OnSite simulator prove that the proposed algorithm could realize safe control revision under complicated scenes, adapting to various planning backbones, road topologies, and risk types. Physical platform validation also verifies the real-world application feasibility.


WeET2 Regular Session, 301	Add to My Program
Learning-Based SLAM 2

Chair: Kim, Donghyun	University of Massachusetts Amherst
Co-Chair: Biggie, Harel	Massachusetts Institute of Technology

16:35-16:40, Paper WeET2.1	Add to My Program
H3-Mapping: Quasi-Heterogeneous Feature Grids for Real-Time Dense Mapping Using Hierarchical Hybrid Representation

Jiang, Chenxing	The Hong Kong University of Science and Technology
Luo, Yiming	The University of Hong Kong
Zhou, Boyu	Southern University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Keywords: Mapping, RGB-D Perception, Visual Learning Abstract: In recent years, implicit online dense mapping methods have achieved high-quality reconstruction results, showcasing great potential in robotics, AR/VR, and digital twins applications. However, existing methods struggle with slow texture modeling which limits their real-time performance. To address these limitations, we propose a NeRF-based dense mapping method that enables faster and higher-quality reconstruction. To improve texture modeling, we introduce quasi-heterogeneous feature grids, which inherit the fast querying ability of uniform feature grids while adapting to varying levels of texture complexity. Besides, we present a gradient-aided coverage-maximizing strategy for keyframe selection that enables the selected keyframes to exhibit a closer focus on rich-textured regions and a broader scope for weak-textured areas. Experimental results demonstrate that our method surpasses existing NeRF-based approaches in texture fidelity, geometry accuracy, and time consumption. The code for our method will be available at: https://github.com/SYSU-STAR/H3-Mapping.

16:40-16:45, Paper WeET2.2	Add to My Program
CEAR: Comprehensive Event Camera Dataset for Rapid Perception of Agile Quadruped Robots

Zhu, Shifan	University of Massachusetts Amherst
Xiong, Zixun	University of Massachusetts Amherst
Kim, Donghyun	University of Massachusetts Amherst
Keywords: Data Sets for SLAM, Data Sets for Robotic Vision, Legged Robots Abstract: When legged robots perform agile movements, traditional RGB cameras often produce blurred images, posing a challenge for rapid perception. Event cameras have emerged as a promising solution for capturing rapid perception and coping with challenging lighting conditions thanks to their low latency, high temporal resolution, and high dynamic range. However, integrating event cameras into agile-legged robots is still largely unexplored. Notably, no dataset including event cameras has yet been developed for the context of agile quadruped robots. To bridge this gap, we introduce CEAR, a dataset comprising data from an event camera, an RGB-D camera, an IMU, a LiDAR, and joint encoders, all mounted on a dynamic quadruped, Mini Cheetah robot. This comprehensive dataset features more than 100 sequences from real-world environments, encompassing various indoor and outdoor environments, different lighting conditions, a range of robot gaits (e.g., trotting, bounding, pronking), as well as acrobatic movements like backflip. To our knowledge, this is the first event camera dataset capturing the dynamic and diverse quadruped robot motions under various setups, developed to advance research in rapid perception for quadruped robots.

16:45-16:50, Paper WeET2.3	Add to My Program
DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-Temporal Fusion

Liu, Mengmeng	University of Twente
Yang, Michael Ying	University of Bath
Liu, Jiuming	Shanghai Jiao Tong University
Zhang, Yunpeng	PhiGent Robotics
Li, Jiangtao	Phigent Robotics
Sander, Oude Elberink	University of Twente
Vosselman, George	University of Twente
Cheng, Hao	University of Twente
Keywords: Localization, Autonomous Agents, SLAM Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.

16:50-16:55, Paper WeET2.4	Add to My Program
Hier-SLAM: Scaling-Up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting

Li, Boying	Shanghai Jiao Tong University
Cai, Zhixi	Monash University
Li, Yuan-Fang	Monash University
Reid, Ian	University of Adelaide
Rezatofighi, Hamid	Monash University
Keywords: SLAM, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: We propose Hier-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hier-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it achieves on-par semantic rendering performance compared to existing methods while significantly reducing storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability. The open-source code is available at https://github.com/LeeBY68/Hier-SLAM.

16:55-17:00, Paper WeET2.5	Add to My Program
CLIP-Clique: Graph-Based Correspondence Matching Augmented by Vision Language Models for Object-Based Global Localization

Matsuzaki, Shigemichi	Toyota Motor Corporation
Tanaka, Kazuhito	Toyota Motor Corporation
Shintani, Kazuhiro	Toyota Motor Corporation
Keywords: Localization, Semantic Scene Understanding, RGB-D Perception Abstract: This paper proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and prone to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.

17:00-17:05, Paper WeET2.6	Add to My Program
CLOi-Mapper: Consistent, Lightweight, Robust, and Incremental Mapper with Embedded Systems for Commercial Robot Services

Noh, DongKi	LG Electronics Inc
Lim, Hyungtae	Massachusetts Institute of Technology
Eoh, Gyuho	Tech University of Korea
Choi, Duckyu	KAIST
Choi, Jeong-Sik	Seoul National University
Lim, Hyunjun	Electronics and Telecommunication Research Institute
Baek, Seung-Min	LG Electronics
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Service Robotics, Embedded Systems for Robotic and Automation, Mapping Abstract: In commercial autonomous service robots with several form factors, simultaneous localization and mapping (SLAM) is an essential technology for providing proper services such as cleaning and guidance. Such robots require SLAM algorithms suitable for specific applications and environments. Hence, several SLAM frameworks have been proposed to address various requirements in the past decade. However, we have encountered challenges in implementing recent innovative frameworks when handling service robots with low-end processors and insufficient sensor data, such as low-resolution 2D LiDAR sensors. Specifically, regarding commercial robots, consistent performance in different hardware configurations and environments is more crucial than the performance dedicated to specific sensors or environments. Therefore, we propose a) a multi-stage approach for global pose estimation in embedded systems; b) a graph generation method with zero constraints for synchronized sensors; and c) a robust and memory-efficient method for long-term pose-graph optimization. As verified in in-home and large-scale indoor environments, the proposed method yields consistent global pose estimation for services in commercial fields. Furthermore, the proposed method exhibits potential commercial viability considering the consistent performance verified via mass production and long-term (> 5 years) operation.

17:05-17:10, Paper WeET2.7	Add to My Program
D2S: Representing Sparse Descriptors and 3D Coordinates for Camera Relocalization

Bui, Bach-Thuan	Ritsumeikan University
Bui, Huy Hoang	Ritsumeikan University
Tran, Dinh Tuan	College of Information Science and Engineering, Ritsumeikan Univ
Lee, Joo-Ho	Ritsumeikan University
Keywords: Localization, Mapping, Vision-Based Navigation Abstract: State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the previous regression-based methods in both indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s


WeET3 Regular Session, 303	Add to My Program
Offroad Navigation

Chair: Roy, Nicholas	Massachusetts Institute of Technology
Co-Chair: Manderson, Travis	McGill University

16:35-16:40, Paper WeET3.1	Add to My Program
CAHSOR: Competence-Aware High-Speed Off-Road Ground Navigation in SE(3)

Pokhrel, Anuj	George Mason University
Nazeri, Mohammad	George Mason University
Datar, Aniket	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Autonomous Vehicle Navigation, Representation Learning, Field Robots Abstract: While the workspace of traditional ground vehi- cles is usually assumed to be in a 2D plane, i.e., SE(2), such an assumption may not hold when they drive at high speeds on unstructured off-road terrain: High-speed sharp turns on high- friction surfaces may lead to vehicle rollover; Turning aggres- sively on loose gravel or grass may violate the non-holonomic constraint and cause significant lateral sliding; Driving quickly on rugged terrain will produce extensive vibration along the vertical axis. Therefore, most offroad vehicles are currently limited to driving only at low speeds to assure vehicle stability and safety. In this work, we aim at empowering high-speed off-road vehicles with competence awareness in SE(3) so that they can reason about the consequences of taking aggressive maneuvers on different terrain with a 6-DoF forward kino- dynamic model. The kinodynamic model is learned from visual, speed, and inertial Terrain Representation for Off-road Navigation ( TRON ) using multimodal, self-supervised vehicle-terrain interactions. We demonstrate the efficacy of our Competence-Aware High- Speed Off-Road ( CAHSOR ) navigation approach on a physical ground robot in both autonomous navigation and a human shared-control setup and show that CAHSOR can efficiently reduce vehicle instability by 62% while only compromising 8.6% average speed with the help of TRON .

16:40-16:45, Paper WeET3.2	Add to My Program
ROD: RGB-Only Fast and Efficient Off-Road Freespace Detection

Sun, Tong	University of Chinese Academy of Sciences
Ye, Hongliang	Zhejiang Lab
Mei, Jilin	Institute of Computing Technology, Chinese Academy of Sciences
Chen, Liang	Institute of Computing Technology: Beijing, CN
Zhao, Fangzhou	Institute of Computing Technology, Chinese Academy of Sciences
Zong, Leiqiang	Beijing Special Vehicle Academy
Hu, Yu	Institute of Computing Technology Chinese Academy of Sciences
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception Abstract: Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models. Our code will be available at https://github.com/STLIFE97/offroad_roadseg.

16:45-16:50, Paper WeET3.3	Add to My Program
JORD: A Benchmark Dataset for Off-Road LiDAR Place Recognition and SLAM

Zhou, Wei	Jilin University
Zhang, Tongzhou	Jilin University
Xu, Qian	China North Vehicle Research Institute
Chen, Yu	Jilin University
Hou, Minghui	Jilin University
Wang, Gang	Jilin University
Keywords: Data Sets for SLAM, SLAM, Mapping Abstract: Simultaneous localization and mapping (SLAM) is a crucial component of unmanned systems, playing a key role in autonomous navigation. Currently, most LiDAR SLAM methods are focused on structured environments. However, highly irregular off-road terrain poses more challenges for LiDAR SLAM tasks, but these environments are not fully represented in existing datasets. To address this issue, we introduce the first dedicated LiDAR SLAM benchmark dataset for off-road environments, named Jlurobot Off-Road Dadaset (JORD). This dataset is collected using a custom avenger data collection platform in large-scale forest off-road scenes, consisting of 8 LiDAR sequences with a total length of approximately 6.07 kilometers, containing 49,144 point cloud frames along with accurate 6DoF ground truth. The dataset includes multiple revisit information within the sequences, making it suitable for LiDAR place recognition and SLAM tasks. Furthermore, we employe several state-of-the-art methods for benchmarking to validate the dataset's challenges. The release of JORD aims to provide researchers with valuable resources to develop new approaches and explore novel directions for unmanned systems in off-road environments. The complete dataset and code is available at https://github.com/jiurobots/JORD.

16:50-16:55, Paper WeET3.4	Add to My Program
Self-Reflective Perceptual Adaptation for Robust Ground Navigation in Unstructured Off-Road Environments

Siva, Sriram	US Army DEVCOM Army Research Laboratory
Youngquist, Oscar	University of Massachusetts Amherst
Wigness, Maggie	U.S. Army Research Laboratory
Rogers III, John G.	US Army Research Laboratory
Zhang, Hao	University of Massachusetts Amherst
Keywords: Vision-Based Navigation, Field Robots, Deep Learning Methods Abstract: Autonomous ground robots navigating unstructured off-road environments face perceptual challenges, such as sensor obscuration or failure, which can lead to inaccurate perception or navigation failures. While robot adaptation has recently gained increasing attention, self-reflective robot adaptation, where robots understand and adjust to their own sensor limitations, remains under-explored. This paper proposes a novel approach for self-reflective perceptual adaptation in order to enhance robust off-road navigation. Our approach enables a robot to identify its own perceptual difficulties and dynamically adapt in challenging environments. The key novelty is learning a modality-invariant perceptual representation that encodes shared sensor data into a compact feature space. Within this representation space, the robot's dynamics model is also learned, which enables accurate prediction of future navigation paths. Extensive experiments in off-road environments with sensor obstructions and failures demonstrate that our method significantly improves adaptive capabilities and outperforms baseline and state-of-the-art approaches.

16:55-17:00, Paper WeET3.5	Add to My Program
Dynamics Modeling Using Visual Terrain Features for High-Speed Autonomous Off-Road Driving

Gibson, Jason	Georgia Institute of Technology
Alavilli, Anoushka	Carnegie Mellon University
Tevere, Erica	Jet Propulsion Laboratory, California Institute of Technology
Theodorou, Evangelos	Georgia Institute of Technology
Spieler, Patrick	JPL
Keywords: Integrated Planning and Learning, Machine Learning for Robot Control, Motion and Path Planning Abstract: Rapid autonomous traversal of unstructured ter- rain is essential for scenarios such as disaster response, search and rescue, or planetary exploration. As a vehicle navigates at the limit of its capabilities over extreme terrain, its dynamics can change suddenly and dramatically. For example, high-speed and varying terrain can affect parameters such as traction, tire slip, and rolling resistance. To achieve effective planning in such environments, it is crucial to have a dynamics model that can accurately anticipate these conditions. In this work, we present a hybrid model that predicts the changing dynamics induced by the terrain as a function of visual inputs. We leverage a pre- trained visual foundation model (VFM) such as DINOv2, which provides rich features that encode fine-grained semantic infor- mation. To use this dynamics model for planning, we propose an end-to-end training architecture for a projection distance independent feature encoder that compresses the information from the VFM, enabling the creation of a lightweight map of the environment at runtime. We validate our architecture on an extensive dataset (hundreds of kilometers of aggressive off-road driving) collected across multiple locations as part of the DARPA Robotic Autonomy in Complex Environments with Resiliency (RACER) program.

17:00-17:05, Paper WeET3.6	Add to My Program
Digital Twins Meet the Koopman Operator: Data-Driven Learning for Robust Autonomy

Samak, Chinmay	Clemson University International Center for Automotive Research
Samak, Tanmay	Clemson University International Center for Automotive Research
Joglekar, Ajinkya	Clemson University
Vaidya, Umesh	Clemson University
Krovi, Venkat	Clemson University
Keywords: Autonomous Vehicle Navigation, Model Learning for Control, Simulation and Animation Abstract: Contrary to on-road autonomous navigation, off-road autonomy is complicated by various factors ranging from sensing challenges to terrain variability. In such a milieu, data-driven approaches have been commonly employed to capture intricate vehicle-environment interactions effectively. However, the success of data-driven methods depends crucially on the quality and quantity of data, which can be compromised by large variability in off-road environments. To address these concerns, we present a novel methodology to recreate the exact vehicle and its target operating conditions digitally for domain-specific data generation. This enables us to effectively model off-road vehicle dynamics from simulation data using the Koopman operator theory, and employ the obtained models for local motion planning and optimal vehicle control. The capabilities of the proposed methodology are demonstrated through an autonomous navigation problem of a 1:5 scale vehicle, where a terrain-informed planner is employed for global mission planning. Results indicate a substantial improvement in off-road navigation performance with the proposed algorithm (5.84x) and underscore the efficacy of digital twinning in terms of improving the sample efficiency (3.2x) and reducing the sim2real gap (5.2%).

17:05-17:10, Paper WeET3.7	Add to My Program
Off-Road Freespace Detection with LiDAR-Camera Fusion and Self-Distillation

Gu, Shuo	Nanjing University of Science and Technology
Gao, Ming	Nanjing University of Science and Technology
Keywords: Intelligent Transportation Systems, Semantic Scene Understanding, Sensor Fusion Abstract: LiDAR-camera fusion has gradually become the mainstream for the freespace detection in unstructured off-road environments. However, existing methods mainly use the traditional method to densify the sparse LiDAR data in the perspective view, which introduces noise and limits the representation ability. In this paper, we propose a lightweight end-to-end freespace detection network with cascaded LiDAR-camera fusion and multi-scale self-distillation. It first performs sparse freespace detection in the range view, and then projects the range-view features onto the perspective view and densifies them. The dense features obtained are fused with camera images to get the final freespace detection results. In our method, the cascaded fusion strategy reduces the impact of resolution differences between LiDAR point clouds and camera images, and the introduction of noise during the data densification process. The multi-scale self-distillation strategy distills knowledge from the LiDAR-camera fusion module to the perspective-view module to further improve the freespace detection performance using LiDAR data only. Experiments on the off-road ORFD datasets demonstrate the effectiveness of the proposed cascaded fusion and multi-scale self-distillation strategies, our method obtains 93.4% IoU at speeds of more than 50 Hz. It also achieves state-of-the-art performance among all LiDAR-based freespace detection methods.

17:10-17:15, Paper WeET3.8	Add to My Program
Learning to Model and Plan for Wheeled Mobility on Vertically Challenging Terrain

Datar, Aniket	George Mason University
Pan, Chenhui	George Mason University
Xiao, Xuesu	George Mason University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Model Learning for Control Abstract: Most autonomous navigation systems assume wheeled robots are rigid bodies and their 2D planar workspaces can be divided into free spaces and obstacles. However, recent wheeled mobility research, showing that wheeled platforms have the potential of moving over vertically challenging terrain (e.g., rocky outcroppings, rugged boulders, and fallen tree trunks), invalidate both assumptions. Navigating off-road vehicle chassis with long suspension travel and low tire pressure in places where the boundary between obstacles and free spaces is blurry requires precise 3D modeling of the interaction between the chassis and the terrain, which is complicated by suspension and tire deformation, varying tire-terrain friction, vehicle weight distribution and momentum, etc. In this paper, we present a learning approach to model wheeled mobility, i.e., in terms of vehicle-terrain forward dynamics, and plan feasible, stable, and efficient motion to drive over vertically challenging terrain without rolling over or getting stuck. We present physical experiments on two wheeled robots and show that planning using our learned model can achieve up to 60% improvement in navigation success rate and 46% reduction in unstable chassis roll and pitch angles.


WeET4 Regular Session, 304	Add to My Program
Sensor Fusion 4

Chair: Choi, Hyouk Ryeol	Sungkyunkwan University
Co-Chair: Huang, Guoquan (Paul)	University of Delaware

16:35-16:40, Paper WeET4.1	Add to My Program
Dynamic Importance-Weighted Fusion Network Based on Dynamic Convolutions for Hand Posture Recognition: A Technique Based on Red, Green, Blue Plus Depth Cameras

Qi, Jing	Beihang University
Ma, Li	Hebei University
Yu, Yushu	Beijing Institute of Technology
Keywords: RGB-D Perception, Human-Robot Collaboration, Object Detection, Segmentation and Categorization Abstract: Hand posture recognition enhances human-computer interaction, with existing algorithms mainly using RGB images or depth data. However, RGB images are affected by lighting and background, while depth data struggles to capture details, reducing accuracy. To address these issues, fusing RGB images and depth data has gained attention. Traditional fusion methods use fixed modal weights, which struggle to adapt to complex modal relationships, causing performance degradation. To resolve this, we propose a Fusion module incorporating Multi-Scale Gated Extraction (MSGE) for multi-scale feature extraction and gating, Context Sensitive Dynamic Filtering (CSDF) for dynamic weight adjustment based on modal importance, and Importance Weighted Fusion (IWF) for adaptive weighting. Based on this, this paper proposes a network that fuses RGB information and depth data, named Dynamic Importance-Weighted Fusion Network (DIWFNet). This network utilizes a dual-branch YOLOv5 framework integrated with four Fusion modules, fully leveraging the complementary nature of RGB images and depth data. Through dynamic weight distribution and adaptive feature convolution, it precisely captures and models the complex interactions between different modalities, enhancing the accuracy and robustness of hand posture recognition. Our method has shown excellent performance on the CUG dataset, NTU dataset, and self-built dataset, and has been successfully applied to robots in real operational environments.

16:40-16:45, Paper WeET4.2	Add to My Program
Robust 4D Radar-Aided Inertial Navigation for Aerial Vehicles

Zhu, Jinwen	Meituan Inc
Hu, Jun	Meituan Inc
Zhao, Xudong	Meituan Inc
Lang, Xiaoming	Meituan
Mao, Yinian	Meituan-Dianping Group
Huang, Guoquan (Paul)	University of Delaware
Keywords: SLAM, Localization Abstract: While LiDAR and cameras are becoming ubiquitous for unmanned aerial vehicles (UAVs) but can be ineffective in challenging environments, 4D millimeter-wave (MMW)radars that can provide robust 3D ranging and Doppler velocity measurements are less exploited for aerial navigation. In this paper, we develop an efficient and robust error-state Kalman filter (ESKF)-based radar-inertial navigation for UAVs. The key idea of the proposed approach is the point-to-distribution radar scan matching to provide motion constraints with proper uncertainty qualification, which are used to update the navigation states in a tightly coupled manner, along with the Doppler velocity measurements. Moreover, we propose a robust keyframe-based matching scheme against the prior map to bound the cumulative navigation errors and provide a radar-based global localization solution with high accuracy. Extensive real-world experimental validations have demonstrated that the proposed radar-aided inertial navigation outperforms state-of-the-art methods in both accuracy and robustness.

16:45-16:50, Paper WeET4.3	Add to My Program
Semi-Elastic LiDAR-Inertial Odometry

Yuan, Zikang	Huazhong University, Wuhan, 430073, China
Lang, Fengtian	Huazhong University of Science and Technology
Xu, Tianle	Huazhong University of Science and Technology
Ming, Ruiye	Huazhong University of Science and Technology
Zhao, Chengwei	Hangzhou Guochen Robot Technology Company Limited
Yang, Xin	Huazhong University of Science and Technology
Keywords: SLAM, Localization, Sensor Fusion Abstract: This work proposes a semi-elastic optimization-based LiDAR-inertial state estimation method, which balances the constraints from LiDAR, IMU and consistency according to their unique characteristics, thereby imparts appropriate elasticity for current state to be optimized to the correct value, and ensure the accuracy, consistency, and robustness of state estimation. We incorporate the proposed LiDAR-inertial state estimation method into a self-developed optimization-based LiDAR-inertial odometry (LIO) framework. Experimental results on four public datasets demonstrate that the proposed method enhances the performance of optimization-based LiDAR-inertial state estimation. We have released the source code of this work for the development of the community.

16:50-16:55, Paper WeET4.4	Add to My Program
DOGE: An Extrinsic Orientation and Gyroscope Bias Estimation for Visual-Inertial Odometry Initialization

Xu, Zewen	Institute of Automation, Chinese Academy of Science
He, Yijia	TCL RayNeo
Wei, Hao	University of Chinese Academy of Sciences
Wu, Yihong	National Laboratory of Pattern Recognition, InstituteofAutomatio
Keywords: Visual-Inertial SLAM Abstract: Most existing visual-inertial odometry (VIO) initialization methods rely on accurate pre-calibrated extrinsic parameters. However, during long-term use, irreversible structural deformation caused by temperature changes, mechanical squeezing, etc. will cause changes in extrinsic parameters, especially in the rotational part. Existing initialization methods that simultaneously estimate extrinsic parameters suffer from poor robustness, low precision, and long initialization latency due to the need for sufficient translational motion. To address these problems, we propose a novel VIO initialization method, which jointly considers extrinsic orientation and gyroscope bias within the normal epipolar constraints, achieving higher precision and better robustness without delayed rotational calibration. First, a rotation-only constraint is designed for extrinsic orientation and gyroscope bias estimation, which tightly couples gyroscope measurements and visual observations and can be solved in pure-rotation cases. Second, we propose a weighting strategy together with a failure detection strategy to enhance the precision and robustness of the estimator. Finally, we leverage Maximum A Posteriori to refine the results before enough translation parallax comes. Extensive experiments have demonstrated that our method outperforms the state-of-the-art methods in both accuracy and robustness while maintaining competitive efficiency.

16:55-17:00, Paper WeET4.5	Add to My Program
GaRLIO: Gravity Enhanced Radar-LiDAR-Inertial Odometry

Noh, Chiyun	Seoul National University
Yang, Wooseong	Seoul National University
Jung, Minwoo	Seoul National University
Jung, Sangwoo	Seoul National University
Kim, Ayoung	Seoul National University
Keywords: SLAM, Localization, Range Sensing Abstract: Recently, gravity has been highlighted as a crucial constraint for state estimation to alleviate potential vertical drift. Existing online gravity estimation methods rely on pose estimation combined with IMU measurements, which is considered best practice when direct velocity measurements are unavailable. However, with radar sensors providing direct velocity data—a measurement not yet utilized for gravity estimation—we found a significant opportunity to improve gravity estimation accuracy substantially. GaRLIO, the proposed gravity- enhanced Radar-LiDAR-Inertial Odometry, can robustly predict gravity to reduce vertical drift while simultaneously enhancing state estimation performance using pointwise velocity measurements. Furthermore, GaRLIO ensures robustness in dynamic environments by utilizing radar to remove dynamic objects from LiDAR point clouds. Our method is validated through experiments in various environments prone to vertical drift, demonstrating superior performance compared to traditional LiDAR-Inertial Odometry methods. We make our source code publicly available to encourage further research and development. https://github.com/ChiyunNoh/GaRLIO

17:00-17:05, Paper WeET4.6	Add to My Program
AF-RLIO: Adaptive Fusion of Radar-LiDAR-Inertial Information for Robust Odometry in Challenging Environments

Qian, Chenglong	Zhejiang University of Technology
Xu, Yang	Zhejiang University
Shi, Xiufang	Zhejiang University of Technology
Chen, Jiming	Zhejiang University
Li, Liang	Zhejiang Univerisity
Keywords: SLAM, Sensor Fusion, Localization Abstract: In robotic navigation, maintaining precise pose estimation and navigation in complex and dynamic environments is crucial. However, environmental challenges such as smoke, tunnels, and adverse weather can significantly degrade the performance of single-sensor systems like LiDAR or GPS, compromising the overall stability and safety of autonomous robots. To address these challenges, we propose AF-RLIO: an adaptive fusion approach that integrates 4D millimeter-wave radar, LiDAR, inertial measurement unit (IMU), and GPS to leverage the complementary strengths of these sensors for robust odometry estimation in complex environments. Our method consists of three key modules. Firstly, the pre-processing module utilizes radar data to assist LiDAR in removing dynamic points and determining when environmental conditions are degraded for LiDAR. Secondly, the dynamic-aware multimodal odometry selects appropriate point cloud data for scan-to-map matching and tightly couples it with the IMU using the Iterative Error State Kalman Filter. Lastly, the factor graph optimization module balances weights between odometry and GPS data, constructing a pose graph for optimization. The proposed approach has been evaluated on datasets and tested in real-world robotic environments, demonstrating its effectiveness and advantages over existing methods in challenging conditions such as smoke and tunnels. Furthermore, we open source our code at https://github.com/NeSC-IV/AF-RLIO.git to benefit the research community.

17:05-17:10, Paper WeET4.7	Add to My Program
Adaptive Measurement Model-Based Fusion of Capacitive Proximity Sensor and LiDAR for Improved Mobile Robot Perception

Kang, Hyunchang	Sungkyunkwan University
Yim, Hongsik	Sungkyunkwan University
Sung, HyukJae	SUNGKYUNKWAN UNIVERSITY
Choi, Hyouk Ryeol	Sungkyunkwan University
Keywords: Sensor Fusion, Human-Robot Collaboration, Robot Safety Abstract: This study introduces a novel algorithm that combines a custom-developed capacitive proximity sensor with LiDAR. This integration targets the limitations of using single-sensor systems for mobile robot perception. Our approach deals with the non-Gaussian distribution that arises during the nonlinear transformation of capacitive sensor data into distance measurements. The non-Gaussian distribution resulting from this nonlinear transformation is linearized using a first-order Taylor approximation, creating a measurement model unique to our sensor. This method helps establish a linear relationship between capacitance values and their corresponding distance measurements. Assuming that the capacitance’s standard deviation remains constant, it is modeled as a distance function. By linearizing the capacitance data and synthesizing it with LiDAR data using Gaussian methods, we fuse the sensor information to enhance integration. This results in more precise and robust distance measurements than those obtained through traditional Extended Kalman Filter (EKF) and Adaptive Extended Kalman Filter (AEKF) methods. The proposed algorithm is designed for real-time data processing, significantly improving the robot’s state estimation accuracy and stability in various environments. This study offers a reliable method for positional estimation of mobile robots, showcasing outstanding fusion performance in complex settings.


WeET5 Regular Session, 305	Add to My Program
Aerial Robots 3

Chair: Schoellig, Angela P.	TU Munich
Co-Chair: Jagannatha Sanket, Nitin	Worcester Polytechnic Institute

16:35-16:40, Paper WeET5.1	Add to My Program
Robust Attitude Control with Fixed Exponential Rate of Convergence and Consideration of Motor Dynamics for Tilt Quadrotor Using Quaternions (I)

Seshasayanan, Sathyanarayanan	Indian Institute of Technology Kanpur
De, Souradip	Assistant Professor, Mnnit Allahabad
Sahoo, Soumya Ranjan	Indian Institute of Technology Kanpur
Keywords: Aerial Systems: Mechanics and Control, Robust/Adaptive Control Abstract: In the existing literature on the robust control design of UAV systems, the controllers are designed without considering motor dynamics. Hence, if these controller gains are not correctly tuned, the system undergoes oscillation and may even go unstable. We have demonstrated this through an experiment in this work. Here, we propose a novel control strategy that considers actuator parameter uncertainties, including motor dynamics for a tilt quadrotor. This strategy is based on the traditional two-loop control scheme where the inner loop controls the angular velocity, and the outer loop controls the vehicle’s attitude based on quaternions. In the quaternion-based controller, usually, the convergence rate increases when the quaternion starts closer to its equilibrium point, thus making it challenging to design a linear controller for the inner loop. To overcome this, we propose a nonlinear control with a varying gain for the outer loop that ensures the quaternion has a fixed convergence rate. We propose the control design of the inner loop, which consists of a disturbance observer (DOB) and a linear controller. The DOB is optimally designed to minimize external disturbances in the presence of model uncertainties. With the DOB, a linear controller is designed for the inner loop, guaranteeing robust stability and performance against the model and actuator parameter uncertainties. The results of experimental flights are reported in this paper.

16:40-16:45, Paper WeET5.2	Add to My Program
Flying through Moving Gates without Full State Estimation

Römer, Ralf	Technical University of Munich
Emmert, Tim	TU Munich
Schoellig, Angela P.	TU Munich
Keywords: Aerial Systems: Mechanics and Control, Vision-Based Navigation Abstract: Autonomous drone racing requires powerful perception, planning, and control and has become a benchmark and test field for autonomous, agile flight. Existing work usually assumes static race tracks with known maps, which enables offline planning of time-optimal trajectories, performing localization to the gates to reduce the drift in visual-inertial odometry (VIO) for state estimation or training learning-based methods for the particular race track and operating environment. In contrast, many real-world tasks like disaster response or delivery need to be performed in unknown and dynamic environments. To make drone racing more robust against unseen environments and moving gates, we propose a control algorithm that operates without a race track map or VIO, relying solely on monocular measurements of the line of sight to the gates. For this purpose, we adopt the law of proportional navigation (PN) to accurately fly through the gates despite gate motions or wind. We formulate the PN-informed vision-based control problem for drone racing as a constrained optimization problem and derive a closed-form optimal solution. Through simulations and real-world experiments, we demonstrate that our algorithm can navigate through moving gates at high speeds while being robust to different gate movements, model errors, wind, and delays.

16:45-16:50, Paper WeET5.3	Add to My Program
Collapsible Airfoil Single Actuator ROtor-Craft (CASARO) - Construction and Analysis of a Soft Rotary Wing Robot

Ang, Wei Jun	Singapore University of Technology & Design
Tang, Emmanuel	Singapore University of Technology & Design
Ng, Matthew	Singapore University of Technology and Design
Foong, Shaohui	Singapore University of Technology and Design
Keywords: Aerial Systems: Applications, Biologically-Inspired Robots, Soft Robot Materials and Design Abstract: In this paper, a soft rotary wing robot capable of flight and control is presented. The Collapsible Airfoil Single Actuator ROtor-craft (CASARO) is a single actuator monocopter that derives its geometric properties from the Samara seed. CASARO achieves better flight efficiency, lift, and handling ergonomics by reducing its overall volume by 91.7% when collapsed and stowed. Unlike conventional rotorcraft, CASARO uses a non-rigid fabric wing to produce lift in flight. It utilizes the robot’s rotational velocity to maintain tension within its fabric and airframe, providing adequate lift during its hover state. The conception, design, construction, and control of the soft monowing are demonstrated, including its capability to reduce its footprint with its soft fabric construction. To analyze the flight dynamics of CASARO, the craft is flown indoors autonomously, tracking its wing surface, craft body attitude, and position with various step inputs to observe different wing dynamics. CASARO is also capable of being deployed outdoors for real-life human-operated flight.

16:50-16:55, Paper WeET5.4	Add to My Program
VizFlyt: Perception-Centric Pedagogical Framework for Autonomous Aerial Robots

Srivastava, Kushagra	Worcester Polytechnic Institute
Kulkarni, Rutwik Sudhakar	Worcester Polytechnic Institute
Velmurugan, Manoj	Worcester Polytechnic Institute
Jagannatha Sanket, Nitin	Worcester Polytechnic Institute
Keywords: Aerial Systems: Perception and Autonomy, Education Robotics, Aerial Systems: Applications Abstract: Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. In this paper, we present VizFlyt, an open-source perception-centric Hardware-In-The-Loop (HITL) photorealistic testing framework for aerial robotics courses. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on VizFlyt for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at https://pear.wpi.edu/research/vizflyt.html

16:55-17:00, Paper WeET5.5	Add to My Program
Distributed Loitering Synchronization with Fixed-Wing UAVs

AlKatheeri, Ahmed	NA
Barcis, Agata	Technology Innovation Institute
Ferrante, Eliseo	Vrije Universiteit Amsterdam
Keywords: Distributed Robot Systems, Swarm Robotics, Multi-Robot Systems Abstract: Distributed loitering synchronization is the process whereby a group of fixed-wing Unmanned Aerial Vehicles (UAVs) align with each other while they follow a circular path in the air. This process is essential to establish proper initial conditions for missions in the real world. We evaluate the performance of three synchronization algorithms using a setup of continuously moving fixed-wing drones randomly placed around a loitering circle. We consider the algorithm based on distributed consensus as a baseline. We propose two methods: the Minimum Of Shortest Arc (MOSA) algorithm that outperforms the baseline in this setup and Firefly multi-Pulse Synchronization (FPS), which is inspired by firefly synchronization. The latter method requires 10 times less communication while maintaining a performance comparable to the baseline. These algorithms were first tested in a simple simulation, then a more realistic simulation environment using Gazebo in which fixed-wing dynamics are considered. The proposed algorithms are rigorously tested in simulation through multiple trials involving a group of 10 UAVs, confirming the effectiveness of our approaches. The results were then validated in real flights using 3 fixed-wing drones. Index Terms— Fixed-Wing UAVs, Distributed Synchronization, Multi-Robot Systems, Pulse-Coupled Oscillators

17:00-17:05, Paper WeET5.6	Add to My Program
A Map-Free Deep Learning-Based Framework for Gate-To-Gate Monocular Visual Navigation Aboard Miniaturized Aerial Vehicles

Scarciglia, Lorenzo	SUPSI, IDSIA
Paolillo, Antonio	IDSIA USI-SUPSI
Palossi, Daniele	ETH Zurich
Keywords: Aerial Systems: Applications, Micro/Nano Robots, Deep Learning for Visual Perception Abstract: Palm-sized autonomous nano-drones, i.e., sub-50 g in weight, recently entered the drone racing scenario, where they are tasked to avoid obstacles and navigate as fast as possible through gates. However, in contrast with their bigger counterparts, i.e., kg-scale drones, nano-drones expose three orders of magnitude less onboard memory and compute power, demanding more efficient and lightweight vision-based pipelines to win the race. This work presents a map-free vision-based (using only a monocular camera) autonomous nano-drone that combines a real-time deep learning gate detection front-end with a classic yet elegant and effective visual servoing control back-end, only relying on onboard resources. Starting from two state-of-the-art tiny deep learning models, we adapt them for our specific task, and after a mixed simulator-real-world training, we integrate and deploy them aboard our nano-drone. Our best-performing pipeline costs of only 24 M multiply- accumulate operations per frame, resulting in a closed-loop control performance of 30 Hz, while achieving a gate detection root mean square error of 1.4 pixels, on our∼20 k real-world image dataset. In-field experiments highlight the capability of our nano-drone to successfully navigate through 15 gates in 4 min, never crashing and covering a total travel distance of ∼100 m, with a peak flight speed of 1.9 m/s. Finally, to stress the generalization capability of our system, we also test it in a never-seen-before environment, where it navigates through gates for more than 4 min.

17:05-17:10, Paper WeET5.7	Add to My Program
Agile Fixed-Wing UAVs for Urban Swarm Operations (I)

Basescu, Max	Johns Hopkins University Applied Physics Lab
Polevoy, Adam	Johns Hopkins University Applied Physics Lab
Yeh, Bryanna	The Johns Hopkins University Applied Physics Laboratory
Scheuer, Luca	Johns Hopkins University Applied Physics Lab
Sutton, Erin	Johns Hopkins University Applied Physics Laboratory
Moore, Joseph	Johns Hopkins University
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Aerial Systems: Applications Abstract: Fixed-wing uncrewed aerial vehicles (UAVs) offer significant performance advantages over rotary-wing UAVs in terms of speed, endurance, and efficiency. Such attributes make these vehicles ideally suited for long-range or high-speed reconnaissance operations and position them as valuable complementary members of a heterogeneous multi-robot team. However, these vehicles have traditionally been severely limited with regards to both vertical take-off and landing (VTOL) as well as maneuverability, which greatly restricts their utility in environments characterized by complex obstacle fields (e.g., forests or urban centers). This paper describes a set of algorithms and hardware advancements that enable agile fixed-wing UAVs to operate as members of a swarm in complex urban environments. At the core of our approach is a direct nonlinear model predictive control (NMPC) algorithm that is capable of controlling fixed-wing UAVs through aggressive post-stall maneuvers. We demonstrate in hardware how our online planning and control technique can enable navigation through tight corridors and in close proximity to obstacles. We also demonstrate how our approach can be combined with onboard stereo vision to enable high speed flight in unknown environments. Finally, we describe our method for achieving swarm system integration; this includes a gimballed propeller design to facilitate automatic take-off, a precision deep-stall landing capability, and multi-vehicle collision avoidance.


WeET6 Regular Session, 307	Add to My Program
Learning for Legged Locomotion 1

Chair: Atanasov, Nikolay	University of California, San Diego
Co-Chair: Wang, Xiaolong	UC San Diego

16:35-16:40, Paper WeET6.1	Add to My Program
Offline Adaptation of Quadrupeds Using Diffusion Models

O'Mahoney, Reece	University of Oxford
Mitchell, Alexander Luis	University of Oxford
Yu, Wanming	University of Oxford
Posner, Ingmar	Oxford University
Havoutis, Ioannis	University of Oxford
Keywords: Legged Robots, Imitation Learning, Machine Learning for Robot Control Abstract: We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills (modes) and of offline adapting to new locomotion behaviours after training. This is the first framework to apply classifier-guided diffusion to quadruped locomotion and demonstrate its efficacy by extracting goal-conditioned behaviour from an originally unlabelled dataset. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.

16:40-16:45, Paper WeET6.2	Add to My Program
High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures

Miller, A.J.	Massachusetts Institute of Technology
Yu, Fangzhou	Robotics and AI Institute
Brauckmann, Michael	AI Institute
Farshidian, Farbod	Robotics and AI Institute
Keywords: Reinforcement Learning, Legged Robots, Deep Learning Methods Abstract: This work presents an overview of the technical details behind a high-performance reinforcement learning policy deployment with the Spot RL Researcher Development Kit for low-level motor access on Boston Dynamic’s Spot. This represents the first public demonstration of an end-to-end reinforcement learning policy deployed on Spot hardware with training code publicly available through Nvidia IsaacLab and deployment code available through Boston Dynamics. We utilize Wasserstein Distance and Maximum Mean Discrepancy to quantify the distributional dissimilarity of data collected on hardware and in simulation to measure our sim-to-real gap. We use these measures as a scoring function for the Covariance Matrix Adaptation Evolution Strategy to optimize simulated parameters that are unknown or difficult to measure from Spot. Our procedure for modeling and training produces high-quality reinforcement learning policies capable of multiple gaits, including a flight phase. We deploy policies capable of over 5.2m/s locomotion, more than triple Spot’s default controller maximum speed, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot. We detail our method and release our code to support future work on Spot with the low-level API.

16:45-16:50, Paper WeET6.3	Add to My Program
HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots

He, Tairan	Carnegie Mellon University
Xiao, Wenli	Carnegie Mellon University
Lin, Toru	University of California, Berkeley
Luo, Zhengyi	Carnegie Mellon University
Xu, Zhenjia	Columbia University
Jiang, Zhenyu	The Unversity of Texas at Austin
Kautz, Jan	NVIDIA
Liu, Changliu	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Wang, Xiaolong	UC San Diego
Fan, Linxi	Stanford University
Zhu, Yuke	The University of Texas at Austin
Keywords: Reinforcement Learning, Legged Robots, Whole-Body Motion Planning and Control Abstract: Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity or position tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications.

16:50-16:55, Paper WeET6.4	Add to My Program
Learning Humanoid Locomotion with Perceptive Internal Model

Long, Junfeng	Shanghai AI Laboratory
Ren, Junli	Hong Kong University
Shi, Moji	Delft University of Technology
Wong, Ziseoi	Zhejiang University
Huang, Tao	The Chinese University of Hong Kong
Luo, Ping	The University of Hong Kong
Pang, Jiangmiao	Shanghai AI Laboratory
Keywords: Humanoid and Bipedal Locomotion, Humanoid Robot Systems, Reinforcement Learning Abstract: In contrast to quadruped robots that can navigate diverse terrains using a "blind" policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.

16:55-17:00, Paper WeET6.5	Add to My Program
A Learning Framework for Diverse Legged Robot Locomotion Using Barrier-Based Style Rewards

Kim, Gijeong	Korea Advanced Institute of Science and Technology, KAIST
Lee, Yonghoon	Korea Advanced Institute of Science and Technology, KAIST
Park, Hae-Won	Korea Advanced Institute of Science and Technology
Keywords: Legged Robots, Humanoid and Bipedal Locomotion, Reinforcement Learning Abstract: This work introduces a model-free reinforcement learning framework that enables various modes of motion (quadruped, tripod, or biped) and diverse tasks for legged robot locomotion. We employ a motion-style reward based on a relaxed logarithmic barrier function as a soft constraint, to bias the learning process toward the desired motion style, such as gait, foot clearance, joint position, or body height. The predefined gait cycle is encoded in a flexible manner, facilitating gait adjustments throughout the learning process. Extensive experiments demonstrate that KAIST HOUND, a 45 kg robotic system, can achieve biped, tripod, and quadruped locomotion using the proposed framework; quadrupedal capabilities include traversing uneven terrain, galloping at 4.67 m/s, and overcoming obstacles up to 58 cm (67 cm for HOUND2); bipedal capabilities include running at 3.6 m/s, carrying a 7.5 kg object, and ascending stairs-all performed without exteroceptive input.

17:00-17:05, Paper WeET6.6	Add to My Program
WildLMa: Long Horizon Loco-Manipulation in the Wild

Qiu, Ri-Zhao	University of California, San Diego
Song, Yuchen	UC San Diego
Peng, Xuanbin	University of California, San Diego
Suryadevara, Sai Aneesh	University of California San Diego
Yang, Ge	Massachusetts Institute of Technology
Liu, Minghuan	Shanghai Jiao Tong University
Ji, Mazeyu	UCSD
Jia, Chengzhe	University of California SanDiego
Yang, Ruihan	UC San Diego
Xueyan Zou, Zou	University of California, San Diego
Wang, Xiaolong	UC San Diego
Keywords: Imitation Learning, Mobile Manipulation, Legged Robots Abstract: `In-the-wild' mobile manipulation aims at deploying robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for such an ability for the extended workspace and robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) a learned low-level controller for VR-enabled whole-body tele-operation and traversability; (2) WildLMa-Skill -- a library of generalizable visuomotor skills acquired via imitation learning or analytical planner and (3) WildLMa-Planner -- an LLM planner that interfaces and coordinates these skills. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. We then show these skills can be effectively interfaced with an LLM planner for autonomous long-horizon execution. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.

17:05-17:10, Paper WeET6.7	Add to My Program
Variable-Frequency Model Learning and Predictive Control for Jumping Maneuvers on Legged Robots

Nguyen, Chuong	University of Southern California
Altawaitan, Abdullah	University of California San Diego
Duong, Thai	Rice University
Atanasov, Nikolay	University of California, San Diego
Nguyen, Quan	University of Southern California
Keywords: Legged Robots, Model Learning for Control Abstract: Achieving both target accuracy and robustness in dynamic maneuvers with long flight phases, such as high or long jumps, has been a significant challenge for legged robots. To address this challenge, we propose a novel learning-based control approach consisting of model learning and model predictive control (MPC) utilizing a variable-frequency scheme. Compared to existing MPC techniques, we learn a model directly from experiments, accounting not only for leg dynamics but also for modeling errors and unknown dynamics mismatch in hardware and during contact. Additionally, learning the model with variable-frequency allows us to cover the entire flight phase and final jumping target, enhancing the prediction accuracy of the jumping trajectory. Using the learned model, we also design variable-frequency to effectively leverage different jumping phases and track the target accurately. In a total of 92 jumps on Unitree A1 robot hardware, we verify that our approach outperforms other MPCs using fixed-frequency or nominal model, reducing the jumping distance error 2 to 8 times. We also achieve jumping distance errors of less than 3 percent during continuous jumping on uneven terrain with randomly-placed perturbations of random heights (up to 4 cm or 27 percent of the robot’s standing height). Our approach obtains distance errors of 1cm to 2cm on 34 single and continuous jumps with different jumping targets and model uncertainties. Code is available at https://github.com/DRCL-USC/Learning_MPC_Jumping.


WeET7 Regular Session, 309	Add to My Program
Perception 3

Chair: Redwan Newaz, Abdullah Al	University of New Orleans
Co-Chair: Araujo, Helder	University of Coimbra

16:35-16:40, Paper WeET7.1	Add to My Program
Drive with the Flow

Mannocci, Enrico	University of Bologna
Poggi, Matteo	University of Bologna
Mattoccia, Stefano	University of Bologna
Keywords: Computer Vision for Transportation, RGB-D Perception, Imitation Learning Abstract: End-to-end autonomous driving systems have recently made rapid progress, thanks to simulators such as CARLA. They can drive without infraction of common driving rules on uncongested roads but are still struggling with dense traffic scenarios. We conjecture that this occurs because it lacks understanding of the dynamics of the surrounding vehicles, caused by the absence of explicit short-term memory within the perception path of end-to-end models. To address this challenge, we revise the perception module to explicitly model temporal information, by extending it with an auxiliary task that is well-known in computer vision research: optical flow. We generate a novel benchmark using the CARLA simulator to train our model, FlowFuser, and prove its superior ability to avoid collisions with other agents on the road.

16:40-16:45, Paper WeET7.2	Add to My Program
Potential Fields As Scene Affordance for Behavior Change-Based Visual Risk Object Identification

Pao, Pang-Yuan	National Yang Ming Chiao Tung University
Lu, Shu-Wei	National Yang Ming Chiao Tung University
Lu, Zeyan	National Yang Ming Chiao Tung University
Chen, Yi-Ting	National Yang Ming Chiao Tung University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Visual Learning Abstract: We study behavior change-based visual risk object identification (Visual-ROI), a crucial formulation for Visual-ROI that aims to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change-based methods are inefficient because they implement causal inference in the perspective image space. We propose a new framework with a Bird’s Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields from perspective images by assigning different energy levels based on the semantic labels acquired through BEV semantic segmentation. We conduct comprehensive experiments and ablation studies, comparing the proposed method with various state-of-the-art algorithms on both synthetic and real-world datasets. Our results show a notable increase in spatial accuracy and temporal consistency, with enhancements of 20.3% and 11.6% on the RiskBench dataset, respectively. Additionally, we can improve computational efficiency by 88%. Similarly, on the nuScenes dataset, we achieve improvements of 5.4% and 7.2% in spatial and temporal consistency.

16:45-16:50, Paper WeET7.3	Add to My Program
SCAM-P: Spatial Channel Attention Module for Panoptic Driving Perception

Erabati, Gopi Krishna	University of Coimbra
Araujo, Helder	University of Coimbra
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Visual Learning Abstract: A high-precision, high-efficiency, and lightweight panoptic driving perception system is an essential part of autonomous driving for optimal maneuver planning of the autonomous vehicle. We propose a simple, lightweight, and ef- ficient SCAM-P multi-task learning network that accomplishes three crucial tasks simultaneously for panoptic driving: vehicle detection, drivable area segmentation, and lane segmentation. To increase the representation power of the shared backbone of our multi-task network, we designed a novel SCAM module with spatially localized channel attention and channel localized spatial attention blocks. SCAM is a lightweight module that can be plugged into any CNN architecture to enhance the semantic features with negligible computational overhead. We integrate our SCAM module and design the SCAM-P network, which has a shared backbone for feature extraction and three independent heads to handle three tasks at the same time. We also designed a nano variant of our SCAM-P network to make it deployment-friendly on edge devices. Our SCAM-P network obtains competitive results on the BDD100K dataset with 81.1 % mAP50 for object detection, 91.6 % mIoU for drivable area segmentation, and 28.8 % IoU for lane segmentation. Our model is robust in various adverse weather conditions, such as rainy, snowy, and at night. Our SCAM-P network not only achieves improved performance but also runs efficiently in real-time at 230.5 FPS on the RTX 4090 GPU and 112.1 FPS on the Jetson Orin edge device.

16:50-16:55, Paper WeET7.4	Add to My Program
IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain

Wang, Zhe	Institute for AI Industry Research, Tsinghua University
Huo, Xiaoliang	Beihang University
Fan, Siqi	Tsinghua University
Wang, Yan	Tsinghua University
Liu, Jingjing	Institute for AI Industry Research (AIR), Tsinghua University
Zhang, Ya-Qin	Institute for AI Industry Research(AIR), Tsinghua University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Visual Learning Abstract: In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector’s performance. The results validate that IROAM has the capability to learn cross-domain information.

16:55-17:00, Paper WeET7.5	Add to My Program
Fast LiDAR Data Generation with Rectified Flows

Nakashima, Kazuto	Kyushu University
Liu, Xiaowen	Kyushu University
Miyawaki, Tomoya	Kyushu University
Iwashita, Yumi	NASA / Caltech Jet Propulsion Laboratory
Kurazume, Ryo	Kyushu University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Representation Learning Abstract: Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite their success, diffusion models require numerous iterations of running neural networks to generate high-quality samples, making the increasing computational cost a potential barrier for robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with significantly fewer sampling steps compared to diffusion models. We also propose an efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on unconditional LiDAR data generation using the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.

17:00-17:05, Paper WeET7.6	Add to My Program
AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving

Sekkat, Ahmed Rida	IAV GmbH
Mohan, Rohit	University of Freiburg
Sawade, Oliver	IAV GmbH
Matthes, Elmar	IAV GmbH
Valada, Abhinav	University of Freiburg
Keywords: Computer Vision for Transportation, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: Unlike humans, who can effortlessly estimate the entirety of objects even when partially occluded, modern computer vision algorithms still find this aspect extremely challenging. Leveraging this amodal perception for autonomous driving remains largely untapped due to the lack of suitable datasets. The curation of these datasets is primarily hindered by significant annotation costs and mitigating annotator subjectivity in accurately labeling occluded regions. To address these limitations, we introduce AmodalSynthDrive, a synthetic multi-task multi-modal amodal perception dataset. The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions. AmodalSynthDrive supports multiple amodal scene understanding tasks including the introduced amodal depth estimation for enhanced spatial understanding. We evaluate several baselines for each of these tasks to illustrate the challenges and set up public benchmarking servers. The dataset is available at http://amodalsynthdrive.cs.uni-freiburg.de.


WeET8 Regular Session, 311	Add to My Program
Representation Learning 4

Chair: Ben Amor, Heni	Arizona State University
Co-Chair: Gan, Lu	Georgia Institute of Technology

16:35-16:40, Paper WeET8.1	Add to My Program
FedEFM: Federated Endovascular Foundation Model with Unseen Data

Do, Tuong	AIOZ
Vu Huu, Nghia	AIOZ
Jianu, Tudor	University of Liverpool
Huang, Baoru	Imperial College London
Vu, Minh Nhat	TU Wien, Austria
Su, Jionglong	Xi'an Jiaotong-Liverpool University
Tjiputra, Erman	AIOZ
Tran, Quang	AIOZ
Chiu, Te-Chuan	National Tsing Hua University
Nguyen, Anh	University of Liverpool
Keywords: Computer Vision for Medical Robotics, Deep Learning Methods Abstract: In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar-domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation framework. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.

16:40-16:45, Paper WeET8.2	Add to My Program
LamPro: Multi-Prototype Representation Learning for Enhanced Visual Pattern Recognition

Qi, Ji	China Mobile (Suzhou) Software Technology Co., Ltd, China
Sun, Wei	University of Science and Technology of China
Huang, Qihe	University of Science and Technology of China
Zhou, Zhengyang	University of Science and Technology of China
Wang, Yang	University of Science and Technology of China
Keywords: Recognition, Computer Vision for Automation, Visual Learning Abstract: Visual pattern recognition usually plays important roles in robotics and automation society where the pattern recognition relies on representation learning. Existing representation learning often neglects two important issues, the diversity of intra-class representation and under-exploited label utilization, especially the negative feedback during training process. Fortunately, prototype learning potentially raises label utilization and encourages intra-class diversity. In this paper, we investigate the intra-class diversity and effective updates in prototype learning for enhanced visual pattern recognition. Specifically, we propose a Label-aware multi-Prototype learning, LamPro, by incorporating the label awareness into both prototype formation and update to improve the representation quality. Firstly, we design a supervised contrastive learning to achieve class-discriminative representations. Secondly, we randomly initialize multiple prototypes and update the nearest prototype upon the arrival of instance, to preserve intra-class diversity. Thirdly, we propose a novel Label-guided Adaptive Updating. We separate the prototype updates from the representation optimization and exploit the label indexes to directly implement the prediction feedback. To correct the model optimization directions, we identify the negative feedback, and correct the prototype updates via queries of labels. Finally, we design a memory-based counter to alternately update these deviated prototypes. Experiments verify the effectiveness of our label-aware and joint multi-prototype updating strategies.

16:45-16:50, Paper WeET8.3	Add to My Program
SAS-Prompt: Large Language Models As Numerical Optimizers for Robot Self-Improvement

Ben Amor, Heni	Arizona State University
Graesser, Laura	Google
Iscen, Atil	Google
D'Ambrosio, David	Google
Abeyruwan, Saminda Wishwajith	Google Inc
Bewley, Alex	Google
Zhou, Yifan	Arizona State University
Kalirathinam, Kamalesh	Arizona State University
Mishra, Swaroop	Google DeepMind
Sanketi, Pannag	Google
Keywords: Learning from Experience, Incremental Learning Abstract: We demonstrate the ability of large language models (LLMs) to perform iterative self-improvement of robot policies. An important insight of this paper is that LLMs have a built-in ability to perform (stochastic) numerical optimization and that this property can be leveraged for explainable robot policy search. Based on this insight, we introduce the SAS Prompt (Summarize, Analyze, Synthesize) – a single prompt that enables iterative learning and adaptation of robot behavior by combining the LLM’s ability to retrieve, reason and optimize over previous robot traces in order to synthesize new, unseen behavior. Our approach can be regarded as an early example of a new family of explainable policy search methods that are entirely implemented within an LLM. We evaluate our approach both in simulation and on a real-robot table tennis task. Project website: sites.google.com/asu.edu/sas-llm/

16:50-16:55, Paper WeET8.4	Add to My Program
Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-Based Autonomous Driving

Xie, Yichen	University of California, Berkeley
Chen, Hongge	Waymo
Meyer, Gregory P.	Motional
Lee, Yong Jae	UW-Madison
Wolff, Eric	Cruise
Tomizuka, Masayoshi	University of California
Zhan, Wei	Univeristy of California, Berkeley
Chai, Yuning	Waymo
Huang, Xin	MIT
Keywords: Computer Vision for Automation, Motion and Path Planning, Representation Learning Abstract: Multi-frame temporal inputs are important for vision-based autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D images as long as we can identify the same instance from different input frames. However, the dynamic nature of driving scenes leads to significant variance in the instance appearance and shape captured by the cameras at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations robust to the changes of distance and perspective in a long-term temporal sequence without any human annotations. In the pretraining stage, raw point clouds from LiDAR sensors are utilized to construct the instance-wise long-term temporal correspondence, which serves as guidance for the extraction of instance-level representation from the vision-based bird's-eye-view (BEV) feature map. Cohere3D encourages consistent representation for the same instance at different frames but distinguishes between different instances. We validate the effectiveness and generalizability of our algorithm by finetuning the pretrained model across key downstream autonomous driving tasks: perception, mapping, prediction, and planning. Results show a notable improvement in both data efficiency and final performance in all these tasks.

16:55-17:00, Paper WeET8.5	Add to My Program
Towards Open-Ended Robotic Exploration Using Vision-Inspired Similarity and Foundation Models

Filntisis, Panagiotis Paraskevas	National Technical University of Athens
Tsaprazlis, Efthymios	Athena Research and Innovation Center
Oikonomou, Paris	National Technical University of Athens (NTUA)
Mattioli, Francesco	AI2Life
Santucci, Vieri Giuliano	Consiglio Nazionale Delle Ricerche
Retsinas, George	National Technical University of Athens
Maragos, Petros	National Technical University of Athens
Keywords: Deep Learning for Visual Perception, Continual Learning, Incremental Learning Abstract: In the domain of robotics, achieving Lifelong Open-ended Learning Autonomy (LOLA) represents a significant milestone, especially in contexts where autonomous agents must adapt to unforeseen environmental variations and evolving objectives. This paper introduces VISOR (Vision-Inspired Similarity for Open-ended Robotic exploration), a vision-based framework designed to assist robotic agents in autonomously exploring and learning from new environments and objects, whether through guided or random exploration, without reliance on predefined design considerations. In that direction, VISOR acts as a perception mediator, classifying everything a robot encounters in a scene as either known or unknown. It further identifies potential distractors (e.g., background elements), known categories, or objects specified through text seeds. By leveraging recent advancements in vision foundation models, VISOR operates in a training-free manner. It begins by segmenting a scene into its constituent entities, regardless of familiarity, and then extracts robust visual representations for each one. These representations are compared against an adaptive memory system that evolves over time; unknown objects are assigned unique IDs and added to this memory as new classes, enriching the robot's understanding of its environment. We argue that this evolving memory can facilitate guided exploration through prior knowledge, enhancing the efficiency of robotic exploration, and validate this by designing two exploration scenarios and running both simulated and real-world experiments.

17:00-17:05, Paper WeET8.6	Add to My Program
MI-HGNN: Morphology-Informed Heterogeneous Graph Neural Network for Legged Robot Contact Perception

Butterfield, Daniel Chase	Georgia Institute of Tehcnology
Garimella, Sandilya Sai	Georgia Institute of Technology
Cheng, NaiJen	Georgia Institute of Technology
Gan, Lu	Georgia Institute of Technology
Keywords: Deep Learning Methods, Force and Tactile Sensing, Legged Robots Abstract: We present a Morphology-Informed Heterogeneous Graph Neural Network (MI-HGNN) for learning-based contact perception. The architecture and connectivity of the MI-HGNN are constructed from the robot morphology, in which nodes and edges are robot joints and links, respectively. By incorporating the morphology-informed constraints into a neural network, we improve a learning-based approach using model-based knowledge. We apply the proposed MI-HGNN to two contact perception problems, and conduct extensive experiments using both real-world and simulated data collected using two quadruped robots. Our experiments demonstrate the superiority of our method in terms of effectiveness, generalization ability, model efficiency, and sample efficiency. Our MI-HGNN improved the performance of a state-of-the-art model that leverages robot morphological symmetry by 8.4% with only 0.21% of its parameters. Although MI-HGNN is applied to contact perception problems for legged robots in this work, it can be seamlessly applied to other types of multi-body dynamical systems and has the potential to improve other robot learning frameworks. Our code is made publicly available at https://github.com/lunarlab-gatech/Morphology-Informed-HGNN.

17:05-17:10, Paper WeET8.7	Add to My Program
Data-Driven Dynamics Modeling of Miniature Robotic Blimps Using Neural ODEs with Parameter Auto-Tuning

Zhu, Yongjian	Peking University
Cheng, Hao	Peking University
Zhang, Feitian	Peking University
Keywords: Dynamics, Calibration and Identification, Machine Learning for Robot Control Abstract: Miniature robotic blimps, as one type of lighter-than-air aerial vehicles, have attracted increasing attention in the science and engineering community for their enhanced safety, extended endurance, and quieter operation compared to quadrotors. Accurately modeling the dynamics of these robotic blimps poses a significant challenge due to the complex aerodynamics stemming from their large lifting bodies. Traditional first-principle models have difficulty obtaining accurate aerodynamic parameters and often overlook high-order nonlinearities, thus coming to their limit in modeling the motion dynamics of miniature robotic blimps. To tackle this challenge, this letter proposes the Auto-tuning Blimp-oriented Neural Ordinary Differential Equation method (ABNODE), a data-driven approach that integrates first-principle and neural network modeling. Spiraling motion experiments of robotic blimps are conducted, comparing the ABNODE with first-principle and other data-driven benchmark models, the results of which demonstrate the effectiveness of the proposed method.


WeET9 Regular Session, 312	Add to My Program
Motion Planning and Control

Chair: Geng, Junyi	Pennsylvania State University
Co-Chair: Brock, Oliver	Technische Universität Berlin

16:35-16:40, Paper WeET9.1	Add to My Program
Improving the Performance of Learned Controllers in Behavior Trees Using Value Function Estimates at Switching Boundaries

Kartašev, Mart	KTH Royal Institute of Technology
Ogren, Petter	Royal Institute of Technology (KTH)
Keywords: Behavior-Based Systems, Control Architectures and Programming, Integrated Planning and Learning Abstract: Behavior trees represent a modular way to create an overall controller from a set of sub-controllers solving different sub-problems. These sub-controllers can be created using various methods, such as classical model based control or reinforcement learning (RL). If each sub-controller satisfies the preconditions of the next sub-controller, the overall controller will achieve the overall goal. However, even if all sub-controllers are locally optimal in achieving the preconditions of the next, with respect to some performance metric such as completion time, the overall controller might still be far from optimal with respect to the same performance metric. In this paper we show how the performance of the overall controller can be improved if we use approximations of value functions to inform the design of a sub-controller of the needs of the next one. We also show how, under certain assumptions, this leads to a globally optimal controller when the process is executed on all sub-controllers. Finally, this result also holds when some of the sub-controllers are already given, i.e., if we are constrained to use some existing sub-controllers the overall controller will be globally optimal given this constraint.

16:40-16:45, Paper WeET9.2	Add to My Program
Deliberative Control-Aware Motion Planning for Kinematic-Constrained UAVs in a Dynamic Environment

Freitas, Elias José de Rezende	Universidade Federal De Minas Gerais
Vangasse, Arthur	Universidade Federal De Minas Gerais
Cohen, Miri Weiss	Braude College of Engineering
Guimarães, Frederico Gadelha	UFMG
Pimenta, Luciano	Universidade Federal De Minas Gerais
Keywords: Constrained Motion Planning, Collision Avoidance, Motion and Path Planning Abstract: This paper introduces a motion planning approach for navigating in a dynamic environment. The path is represented using a Non-Uniform Rational B-Spline (NURBS) to ensure smoothness, curvature continuity, and proper orientation by adjusting its parameters. A Differential Evolution algorithm optimizes the curve parameters and traversal speed at each re-planning interval, taking into account speed limits, maximum curvature, and obstacles in the environment. A constraint-based on Velocity Obstacle (VO) ensures collision-free motion, considering bounds provided by lower-level controllers. The feasibility of the approach is validated through simulations and real-world experiments with the Crazyflie 2.1 micro quadcopter.

16:45-16:50, Paper WeET9.3	Add to My Program
Robot Navigation in Unknown and Cluttered Workspace with Dynamical System Modulation in Starshaped Roadmap

Chen, Kai	The Hong Kong University of Science and Technology
Liu, Haichao	The Hong Kong University of Science and Technology
Li, Yulin	Hong Kong University of Science and Technology(HKUST)
Duan, Jianghua	Hong Kong University of Science and Technology
Zhu, Lei	The Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Integrated Planning and Control, Autonomous Vehicle Navigation, Sensor-based Control Abstract: Compared to conventional decomposition methods that use ellipses or polygons to represent free space, starshaped representation can better capture the natural distribution of sensor data, thereby exploiting a larger portion of traversable space. This paper introduces a novel motion planning and control framework for navigating robots in unknown and cluttered environments using a dynamically constructed starshaped roadmap. Our approach generates a starshaped representation of the surrounding free space from real-time sensor data using piece-wise polynomials. Additionally, an incremental roadmap maintaining the connectivity information is constructed, and a searching algorithm efficiently selects short-term goals on this roadmap. Importantly, this framework addresses deadend situations with a graph updating mechanism. To ensure safe and efficient movement within the starshaped roadmap, we propose a reactive controller based on Dynamic System Modulation (DSM). This controller facilitates smooth motion within starshaped regions and their intersections, avoiding conservative and short-sighted behaviors and allowing the system to handle intricate obstacle configurations in unknown and cluttered environments. Comprehensive evaluations in both simulations and real-world experiments show that the proposed method achieves higher success rates and reduced travel times compared to other methods. It effectively manages intricate obstacle configurations, avoiding conservative and myopic behaviors.

16:50-16:55, Paper WeET9.4	Add to My Program
Robust Planning for Autonomous Driving Via Mixed Adversarial Diffusion Predictions

Zhao, Albert	University of California Los Angeles
Soatto, Stefano	UCLA
Keywords: Planning under Uncertainty, Robot Safety, Autonomous Vehicle Navigation Abstract: We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.

16:55-17:00, Paper WeET9.5	Add to My Program
Autonomous Navigation in Ice-Covered Waters with Learned Predictions on Ship-Ice Interactions

Zhong, Ninghan	University of Illinois at Urbana-Champaign
Potenza, Alessandro	University of Manitoba
Smith, Stephen L.	University of Waterloo
Keywords: Integrated Planning and Learning, Marine Robotics, Motion and Path Planning Abstract: Autonomous navigation in ice-covered waters poses significant challenges due to the frequent lack of viable collision-free trajectories. When complete obstacle avoidance is infeasible, it becomes imperative for the navigation strategy to minimize collisions. Additionally, the dynamic nature of ice, which moves in response to ship maneuvers, complicates the path planning process. To address these challenges, we propose a novel deep learning model to estimate the coarse dynamics of ice movements triggered by ship actions through occupancy estimation. To ensure real-time applicability, we propose a novel approach that caches intermediate prediction results and seamlessly integrates the predictive model into a graph search planner. We evaluate the proposed planner in both simulation and in a physical testbed against existing approaches and show that our planner significantly reduces collisions with ice when compared to the state-of-the-art. Codes and demos of this work are available at https://github.com/IvanIZ/predictive-asv-planner.

17:00-17:05, Paper WeET9.6	Add to My Program
IKap: Kinematics-Aware Planning with Imperative Learning

Li, Qihang	University at Buffalo
Chen, Zhuoqun	University of California San Diego
Zheng, Haoze	University at Buffalo
He, Haonan	Department of Mechanical Engineering, College of Engineering, Ca
Zhan, Zitong	University at Buffalo, SUNY
Su, Shaoshu	State University of New York at Buffalo
Geng, Junyi	Pennsylvania State University
Wang, Chen	University at Buffalo
Keywords: Integrated Planning and Learning, Collision Avoidance, Motion and Path Planning Abstract: Trajectory planning in robotics aims to generate collision-free pose sequences that can be reliably executed. Recently, vision-to-planning systems have gained increasing attention for their efficiency and ability to interpret and adapt to surrounding environments. However, traditional modular systems suffer from increased latency and error propagation, while purely data-driven approaches often overlook the robot's kinematic constraints. This oversight leads to discrepancies between planned trajectories and those that are executable. To address these challenges, we propose iKap, a novel vision-to-planning system that integrates the robot's kinematic model directly into the learning pipeline. iKap employs a self-supervised learning approach and incorporates the state transition model within a differentiable bi-level optimization framework. This integration ensures the network learns collision-free waypoints while satisfying kinematic constraints, enabling gradient back-propagation for end-to-end training. Our experimental results demonstrate that iKap achieves higher success rates and reduced latency compared to the state-of-the-art methods. Besides the complete system, iKap offers a visual-to-planning network that seamlessly works with various controllers, providing a robust solution for robots navigating complex environments.

17:05-17:10, Paper WeET9.7	Add to My Program
Differentiable-Optimization Based Neural Policy for Occlusion-Aware Target Tracking

Masnavi, Houman	Toronto Metropolitan University
Singh, Arun Kumar	University of Tartu
Janabi-Sharifi, Farrokh	Ryerson University
Keywords: Aerial Systems: Applications, Motion and Path Planning, Integrated Planning and Learning Abstract: We propose a learned probabilistic neural policy for safe, occlusion-free target tracking. The core novelty of our work stems from the structure of our policy network that combines generative modeling based on Conditional Variational Autoencoder (CVAE) with differentiable optimization layers. The weights of the CVAE network and the parameters of the differentiable optimization can be learned in an end-to-end fashion through demonstration trajectories. We improve the state-of-the-art (SOTA) in the following respects. We show that our learned policy outperforms existing SOTA in terms of occlusion/collision avoidance capabilities and computation time. Second, we present an extensive ablation showing how different components of our learning pipeline contribute to the overall tracking task. We also demonstrate the real-time performance of our approach on resource-constrained hardware such as NVIDIA Jetson TX2. Finally, our learned policy can also be viewed as a reactive planner for navigation in highly cluttered environments.


WeET10 Regular Session, 313	Add to My Program
Multi-Robot Planning

Chair: Li, Jiaoyang	Carnegie Mellon University
Co-Chair: Seiler, Konstantin M	University of Technology Sydney

16:35-16:40, Paper WeET10.1	Add to My Program
Multi-Horizon Multi-Agent Planning Using Decentralised Monte Carlo Tree Search

Seiler, Konstantin M	University of Technology Sydney
Kong, Felix Honglim	The University of Technology Sydney
Fitch, Robert	University of Technology Sydney
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: We propose multi-horizon Monte Carlo tree search (MH-MCTS), the first framework for integrated hierarchical multi-horizon, multi-agent planning based on Monte Carlo tree search (MCTS). The method employs multiple simultaneous MCTS optimisations for each planning level within each agent, which are designed to optimise a joint objective function. Using concepts from decentralised Monte Carlo tree search (Dec-MCTS), the individual optimisations continuously exchange information about their current plans. This breaks the common top-down only information flow within the planning hierarchy and allows higher level optimisers to consider progress made by lower level planners. The method is implemented for survey missions using a fleet of ground robots. Simulation results with different mission profiles show substantial performance improvements of the new method of up to 59% compared to traditional MCTS and Dec-MCTS.

16:40-16:45, Paper WeET10.2	Add to My Program
Generalized Mission Planning for Heterogeneous Multi-Robot Teams Via LLM-Constructed Hierarchical Trees

Gupta, Piyush	Honda Research Institute, US
Isele, David	University of Pennsylvania, Honda Research Institute USA
Sachdeva, Enna	Honda Research Institute
Huang, Pin-Hao	Honda Research Institute
Dariush, Behzad	Honda Research Institute USA
Lee, Kwonjoon	Honda Research Institute USA
Bae, Sangjae	Honda Research Institute, USA
Keywords: Multi-Robot Systems, Task Planning, AI-Enabled Robotics Abstract: We present a novel mission-planning strategy for heterogeneous multi-robot teams, taking into account the specific constraints and capabilities of each robot. Our approach employs hierarchical trees to systematically break down complex missions into manageable sub-tasks. We develop specialized APIs and tools, which are utilized by Large Language Models (LLMs) to efficiently construct these hierarchical trees. Once the hierarchical tree is generated, it is further decomposed to create optimized schedules for each robot, ensuring adherence to their individual constraints and capabilities. We demonstrate the effectiveness of our framework through detailed examples covering a wide range of missions, showcasing its flexibility and scalability.

16:45-16:50, Paper WeET10.3	Add to My Program
Efficient Coordination and Synchronization of Multi-Robot Systems under Recurring Linear Temporal Logic

Peron, Davide	Università Degli Studi Di Padova
Nan Fernandez-Ayala, Victor	KTH Royal Institute of Technology
Vlahakis, Eleftherios E.	KTH Royal Institute of Technology
Dimarogonas, Dimos V.	KTH Royal Institute of Technology
Keywords: Cooperating Robots, Multi-Robot Systems, Task and Motion Planning Abstract: We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

16:50-16:55, Paper WeET10.4	Add to My Program
HULK: Large-Scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks

Luo, Qingyuan	Peking University
Li, Jie	National University of Defense Technology
Guo, Meng	Peking University
Keywords: Multi-Robot Systems, Task and Motion Planning, Formal Methods in Robotics and Automation Abstract: Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: (i) selecting appropriate subteams for different tasks in various areas; (ii) coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline.However, in many applications, the team-wise tasks are generated online continually by external requests; and the amount of subtasks within each task is uncertain (e.g., the number of packages to deliver, and victims to rescue). The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework (HULK) consists of two interleaved layers: the rolling assignment of currently-known tasks to sub-teams within a certain horizon, and the dynamic coordination within a sub-team given the detected subtasks during online execution. Thus, the coordination is performed hierarchically at different granularities and triggering conditions, to improve the computational efficiency and robustness. It is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.

16:55-17:00, Paper WeET10.5	Add to My Program
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models

Liu, Kehui	Northwestern Polytechnical University
Tang, Zixin	National University of Defense Technology
Wang, Dong	Shanghai Artificial Intelligence Laboratory
Wang, Zhigang	Shanghai AI Laboratory
Li, Xuelong	Northwestern Polytechnical University
Zhao, Bin	Northwestern Polytechnical University
Keywords: Multi-Robot Systems, Cooperating Robots Abstract: Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaboration among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at https://github.com/MrKeee/COHERENT.

17:00-17:05, Paper WeET10.6	Add to My Program
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner

Zhang, Xiaopan	University of California - Riverside
Qin, Hao	Pennsylvania State University
Wang, Fuquan	University of California Riverside
Dong, Yue	University of California Riverside
Li, Jiachen	University of California, Riverside
Keywords: Multi-Robot Systems, Cooperating Robots, AI-Enabled Robotics Abstract: Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs’ reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36% higher efficiency than existing LM-based multi-agent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.

17:05-17:10, Paper WeET10.7	Add to My Program
FlyKites: Human-Centric Interactive Exploration and Assistance under Limited Communication

Zhang, Yuyang	Peking University
Tian, Zhuoli	Peking University
Wei, Jinsheng	Peking University
Guo, Meng	Peking University
Keywords: Multi-Robot Systems, Task and Motion Planning, Human-Robot Teaming Abstract: Fleets of autonomous robots have been deployed for exploration of unknown scenes for features of interest, e.g., subterranean exploration, reconnaissance, search and rescue missions. During exploration, the robots may encounter un-identified targets, blocked passages, interactive objects, temporary failure, or other unexpected events, all of which require consistent human assistance with reliable communication for a time period. This however can be particularly challenging if the communication among the robots is severely restricted to only close-range exchange via ad-hoc networks, especially in extreme environments like caves and underground tunnels. This paper presents a novel human-centric interactive exploration and assistance framework called FlyKites, for multi-robot systems under limited communication. It consists of three interleaved components: (I) the distributed exploration and intermittent communication (called the ``spread mode"), where the robots collaboratively explore the environment and exchange local data among the fleet and with the operator; (II) the simultaneous optimization of the relay topology, the operator path, and the assignment of robots to relay roles (called the ``relay mode"), such that all requested assistance can be provided with minimum delay; (III) the human-in-the-loop online execution, where the robots switch between different roles and interact with the operator adaptively. Extensive human-in-the-loop simulations and hardware experiments are performed over numerous challenging scenes.

17:10-17:15, Paper WeET10.8	Add to My Program
Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large-Scale Imitation Learning for MAPF

Veerapaneni, Rishi	Carnegie Mellon University
Jakobsson, Arthur	Carnegie Mellon University
Ren, Kevin	Carnegie Mellon University
Kim, Samuel	Solon High School
Li, Jiaoyang	Carnegie Mellon University
Likhachev, Maxim	Carnegie Mellon University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Imitation Learning Abstract: Multi-Agent Path Finding (MAPF) is the problem of effectively finding efficient collision-free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high-performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set-ups, but none using large amounts of high-quality supervised data. Our initial objective in this work was to show how simple large-scale imitation learning of high-quality heuristic search methods can lead to state-of-the-art ML MAPF performance. However, we find that, at least with our model architecture, simple large-scale (700k examples with hundreds of agents per example) imitation learning does not produce impressive results. Instead, we find that by using prior work that post-processes MAPF model predictions to resolve 1-step collisions (CS-PIBT), we can train a simple ML MAPF policy in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should always (1) use smart 1-step collision shields (e.g. CS-PIBT) and (2) include the collision shield with greedy actions as a baseline (e.g. PIBT), as well as (3) motivates future models to focus on longer horizon / more complex planning as 1-step collisions can be efficiently resolved.


WeET11 Regular Session, 314	Add to My Program
Agile Legged Locomotion

Chair: Clark, Jonathan	Florida State University
Co-Chair: Cutkosky, Mark	Stanford University

16:35-16:40, Paper WeET11.1	Add to My Program
Mastering Agile Jumping Skills from Simple Practices with Iterative Learning Control

Nguyen, Chuong	University of Southern California
Bao, Lingfan	University College London
Nguyen, Quan	University of Southern California
Keywords: Legged Robots, Learning from Experience, Model Learning for Control Abstract: Achieving precise target jumping with legged robots poses a significant challenge due to the long flight phase and the uncertainties inherent in contact dynamics and hardware. Forcefully attempting these agile motions on hardware could result in severe failures and potential damage. Motivated by this challenge, we propose an Iterative Learning Control (ILC) approach to learn and refine jumping skills from easy to difficult, instead of directly learning these challenging tasks. We verify that learning from simplicity can enhance safety and target jumping accuracy over trials. Compared to other ILC approaches for legged locomotion, our method can tackle the problem of a long flight phase where control input is not available. In addition, our approach allows the robot to apply what it learns from a simple jumping task to accomplish more challenging tasks within a few trials directly in hardware, instead of learning from scratch. We validate the method through extensive experiments on the A1 model and hardware for various tasks. Starting from a small jump (e.g., a forward jump 40cm), our learning approach empowers the robot to accomplish a variety of challenging targets, including jumping onto a 20cm high box, leaping to a greater distance of up to 60cm, as well as performing jumps while carrying an unknown payload of 2kg. Our framework allows the robot to reach the desired position and orientation targets with approximate errors of 1cm and 1 degree within a few trials.

16:40-16:45, Paper WeET11.2	Add to My Program
Agile Continuous Jumping in Discontinuous Terrains

Yang, Yuxiang	Google Deepmind
Shi, Guanya	Carnegie Mellon University
Lin, Changyi	Carnegie Mellon University
Meng, Xiangyun	University of Washington
Scalise, Rosario	University of Washington
Guaman Castro, Mateo	University of Washington
Yu, Wenhao	Google
Zhang, Tingnan	Google
Zhao, Ding	Carnegie Mellon University
Tan, Jie	Google
Boots, Byron	University of Washington
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Legged Robots Abstract: We focus on advancing the agility of quadrupedal robots with continuous, precise, and terrain-adaptive jumping in discontinuous terrains such as stairs and stepping stones. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Such a hierarchical and hybrid framework effectively combines the advantages of model-free learning and model-based control, therefore enabling a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step stair in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities.

16:45-16:50, Paper WeET11.3	Add to My Program
High Accuracy Aerial Maneuvers on Legged Robots Using Variational Integrator Discretized Trajectory Optimization

Beck, Scott	University of Southern California
Nguyen, Chuong	University of Southern California
Duong, Thai	Rice University
Atanasov, Nikolay	University of California, San Diego
Nguyen, Quan	University of Southern California
Keywords: Legged Robots, Optimization and Optimal Control Abstract: Performing acrobatic maneuvers involving long aerial phases, such as precise dives or multiple backflips from significant heights, remains an open challenge in legged robot autonomy. Such aggressive motions often require accurate state predictions over long horizons with multiple contacts and extended flight phases. Most existing trajectory optimization (TO) methods rely on Euler or Runge-Kutta integration, which can accumulate significant prediction errors over long planning horizons. In this work, we propose a novel whole-body TO method using variational integration (VI) and full-body nonlinear dynamics for long-flight aggressive maneuvers. Compared to traditional Euler-based TO, our approach using VI preserves energy and momentum properties of the continuous time system and reduces error between predicted and executed trajectories by factors of between 2 − 10 while achieving similar planning time. We successfully demonstrate long-flight triple backflips on a quadruped A1 robot model and backflips on a bipedal HECTOR robot model for various heights and distances, achieving landing angle errors of only a few degrees. In contrast, TO with Euler integration fails to achieve accurate landings in equivalent circumstances, e.g., with landing angle errors greater than 90◦ for triple backflips. We provide an open-source implementation of our VI-discretized TO to support further research on accurate dynamic maneuvers for multi-rigid-body robot systems with contact: https://github.com/DRCL-USC/VI_discretized_TO

16:50-16:55, Paper WeET11.4	Add to My Program
Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization

Han, Fei	Westlake University
Guo, Pengming	Westlake University
Chen, Hao	Westlake University
Li, Weikun	Westlake University
Ren, Jingbo	Xinyang Normal University
Liu, Naijun	Institute of Automation Chinese Academy of Sciences
Yang, Ning	Institute of Automation, Chinese Academy of Sciences
Fan, Dixia	Westlake University
Keywords: Legged Robots, Model Learning for Control, Whole-Body Motion Planning and Control Abstract: This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FED-LSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional empirical formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straight-line and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.

16:55-17:00, Paper WeET11.5	Add to My Program
Stage-Wise Reward Shaping for Acrobatic Robots: A Constrained Multi-Objective Reinforcement Learning Approach

Kim, Dohyeong	Seoul National University
Kwon, Hyeokjin	Seoul National University
Kim, Junseok	Seoul National University
Lee, Gunmin	Seoul National University
Oh, Songhwai	Seoul National University
Keywords: Reinforcement Learning, Legged Robots, Robot Safety Abstract: As the complexity of tasks addressed through reinforcement learning (RL) increases, the definition of reward functions also has become highly complicated. We introduce an RL method aimed at simplifying the reward-shaping process through intuitive strategies. Initially, instead of a single reward function composed of various terms, we define multiple reward and cost functions within a constrained multi-objective RL (CMORL) framework. For tasks involving sequential complex movements, we segment the task into distinct stages and define multiple rewards and costs for each stage. Finally, we introduce a practical CMORL algorithm that maximizes objectives based on these rewards while satisfying constraints defined by the costs. The proposed method has been successfully demonstrated across a variety of acrobatic tasks in both simulation and real-world environments. Additionally, it has been shown to successfully perform tasks compared to existing RL and constrained RL algorithms. Our code is available at https://github.com/rllab-snu/Stage-Wise-CMORL.

17:00-17:05, Paper WeET11.6	Add to My Program
Design and Implementation of a Swimming and Walking Quadruped for Seafloor Exploration

Chase, Ashley	Florida State University
Labiner, Benjamin	North Carolina State University
Boylan, Jonathan	FAMU-FSU College of Engineering
Ryals, Cameron	Florida State University
Vranicar, Jack	Florida State University
Dina, Michael	Florida State University
Vasquez, Derek A.	Florida State University
Seal, Dane	Florida State University
Young, Charles	Florida State University
St Laurent, Louis	University of Washington
Ordonez, Camilo	Florida State University
Clark, Jonathan	Florida State University
Keywords: Legged Robots, Biologically-Inspired Robots, Marine Robotics Abstract: The seafloor is a complex environment and it is challenging to conduct detailed mapping, soil composition sampling, and habitat characterization missions in this benthic region. As a step toward overcoming these challenges, we present a quadruped robot capable of walking on the seafloor and maneuvering via midfluid swimming. SELQIE, the Seafloor Environment Legged Quadruped Intelligent Explorer, is capable of walking underwater at speeds up to 0.2 m/s, swimming at over 0.16 m/s, and transitioning between modes. We also introduce a path planning algorithm that can account for both swimming and walking gaits to efficiently navigate around or over obstacles, and demonstrate the robot executing such a multi-modal trajectory.

17:05-17:10, Paper WeET11.7	Add to My Program
Beyond Robustness: Learning Unknown Dynamic Load Adaptation for Quadruped Locomotion on Rough Terrain

Chang, Leixin	Zhejiang University
Nai, Yuxuan	Zhejiang University
Chen, Hua	Zhejiang University
Yang, Liangjing	Zhejiang University
Keywords: Reinforcement Learning, Legged Robots Abstract: Unknown dynamic load carrying is one important practical application for quadruped robots. Such a problem is non-trivial, posing three major challenges in quadruped loco- motion control. First, how to model or represent the dynamics of the load in a generic manner. Second, how to make the robot capture the dynamics without any external sensing. Third, how to enable the robot to interact with load handling the mutual effect and stabilizing the load. In this work, we propose a general load modeling approach called load characteristics modeling to capture the dynamics of the load. We integrate this proposed modeling technique and leverage recent advances in Reinforcement Learning (RL) based locomotion control to enable the robot to infer the dynamics of load movement and interact with the load indirectly to stabilize it and realize the sim-to-real deployment to verify its effectiveness in real scenarios. We conduct extensive comparative simulation experiments to validate the effectiveness and superiority of our proposed method. Results show that our method outperforms other methods in sudden load resistance, load stabilizing and locomotion with heavy load on rough terrain.

17:10-17:15, Paper WeET11.8	Add to My Program
PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots

Luo, Shixin	Zhejiang University
Li, Songbo	Zhejiang University
Yu, Ruiqi	Zhejiang University
Wang, Zhicheng	Zhejiang University
Wu, Jun	Zhejiang University
Zhu, Qiuguo	Zhejiang University
Keywords: Legged Robots, Reinforcement Learning, Deep Learning for Visual Perception Abstract: Parkour presents a highly challenging task for legged robots, requiring them to traverse various terrains with agile and smooth locomotion. This necessitates comprehensive understanding of both the robot's own state and the surrounding terrain, despite the inherent unreliability of robot perception and actuation. Current state-of-the-art methods either rely on complex pre-trained high-level terrain reconstruction modules or limit the maximum potential of robot parkour to avoid failure due to inaccurate perception. In this paper, we propose a one-stage end-to-end learning-based parkour framework: Parkour with Implicit-Explicit learning framework for legged robots (PIE) that leverages dual-level implicit-explicit estimation. With this mechanism, even a low-cost quadruped robot equipped with an unreliable egocentric depth camera can achieve exceptional performance on challenging parkour terrains using a relatively simple training process and reward function. While the training process is conducted entirely in simulation, our real-world validation demonstrates successful zero-shot deployment of our framework, showcasing superior parkour performance on harsh terrains.


WeET12 Regular Session, 315	Add to My Program
Visual Servoing and Tracking

Chair: Chaumette, Francois	Inria Center at University of Rennes
Co-Chair: Cheng, Sheng	University of Illinois Urbana-Champaign

16:35-16:40, Paper WeET12.1	Add to My Program
Determination of All Stable and Unstable Equilibria for Image-Point-Based Visual Servoing

Colotti, Alessandro	Centre Inria De l'Université De Rennes
García Fontán, Jorge	Sorbonne Université
Goldsztejn, Alexandre	CNRS IRCCyN
Briot, Sébastien	LS2N
Chaumette, Francois	Inria Center at University of Rennes
Kermorgant, Olivier	École Centrale Nantes, Laboratoire Des Sciences Du Numérique De
Safey El Din, Mohab	Sorbonne Univ
Keywords: Visual Servoing, Formal Methods in Robotics and Automation, Sensor-based Control, Stability Analysis Abstract: Local minima are a well-known drawback of image-based visual servoing systems. Up to now, there were no formal guarantees on their number, or even their existence, according to the considered configuration. In this work, a formal approach is presented for the exhaustive computation of all minima and unstable equilibria for a class of six well-known image- based visual servoing controllers. This approach relies on a new polynomial formulation of the equilibrium condition that avoids using the camera pose. By using modern computational algebraic geometry methods and an ad hoc symmetry breaking strategy, the formal resolution of this new equilibrium condition is rendered computationally feasible. The proposed methodology is applied to compute the equilibria of several classical visual servoing tasks, with planar and non-planar configurations of four and five points. The effects of local minima and saddle points on the dynamics of the system are finally illustrated through intensive simulation results, as well as the effects of image noise and uncertainties on depths.

16:40-16:45, Paper WeET12.2	Add to My Program
DiffTune: Auto-Tuning through Auto-Differentiation

Cheng, Sheng	University of Illinois Urbana-Champaign
Kim, Minkyung	University of Illinois Urbana-Champaign
Song, Lin	UIUC
Yang, Chengyu	University of Illinois Urbana-Champaign
Jin, Yiquan	Zhejiang University
Wang, Shenlong	University of Illinois at Urbana-Champaign
Hovakimyan, Naira	University of Illinois at Urbana-Champaign
Keywords: Control Architectures and Programming, Learning and Adaptive Systems, Aerial Systems: Mechanics and Control, auto-tuning Abstract: The performance of robots in high-level tasks depends on the quality of their lower-level controller, which requires fine-tuning. However, the intrinsically nonlinear dynamics and controllers make tuning a challenging task when it is done by hand. We present DiffTune, a novel, gradient-based automatic tuning framework. We formulate the controller tuning as a parameter optimization problem and update the controller parameters through gradient-based optimization. The gradient is obtained using sensitivity propagation, which is the only method for gradient computation when tuning for a physical system instead of its simulated counterpart. Furthermore, we use L1 adaptive control to compensate for the uncertainties so that the gradient is not biased by the unmodelled uncertainties. We validate the DiffTune in simulation and compare it with state-of-the-art auto-tuning methods, where DiffTune achieves the best performance in a more efficient manner. Experiments on auto-tuning a nonlinear controller for quadrotor show promising results, where DiffTune achieves 3.5x tracking error reduction on an aggressive trajectory in only 10 trials over a 12-dimensional controller par

16:45-16:50, Paper WeET12.3	Add to My Program
Output Feedback with Feedforward Robust Control for Motion Systems Driven by Nonlinear Position-Dependent Actuators (I)

Al Saaideh, Mohammad	Memorial University of Newfoundland
Boker, Almuatazbellah	Virginia Tech
Al Janaideh, Mohammad	University of Guelph
Keywords: Actuation and Joint Mechanisms, Motion Control Abstract: This paper introduces a control approach for a motion system driven by a class of actuators with multiple nonlinearities. The proposed approach presents a combination of a feedforward controller and an output feedback controller to enhance the tracking performance of the motion system. The feedforward controller is mainly proposed to address the actuator dynamics and provide a linearization of the actuator without requiring measurements from the actuator. Subsequently, the output feedback controller is designed using the measured position to achieve a tracking objective for a desired reference signal, considering the unknown nonlinearities in the system and the error due to the open-loop compensation using feedforward control. The efficacy of the proposed control approach is validated through three applications: reluctance actuator, electrostatic microactuator, and magnetic levitation system. Both simulation and experimental results demonstrate the effectiveness of the proposed control approach in achieving the desired reference signal with minimal tracking error, considering that the actuator and system nonlinearities are unknown.

16:50-16:55, Paper WeET12.4	Add to My Program
QP-Based Visual Servoing under Motion Blur-Free Constraint

Robic, Maxime	University of Picardy Jules Verne
Fraisse, Renaud	Airbus Defence & Space
Marchand, Eric	Univ Rennes, Inria, CNRS, IRISA
Chaumette, Francois	Inria Center at University of Rennes
Keywords: Visual Servoing, Space Robotics and Automation, Visual Tracking Abstract: This work proposes a QP-based visual servoing scheme for limiting motion blur during the achievement of a visual task. Unlike traditional image restoration approaches, we want to avoid any deconvolution step by keeping the image sequence acquired by the camera as sharp as possible. To do so, we select the norm of the image gradient as sharpness metric, from which we design a velocity constraint that is injected in a QP controller. Our system is evaluated for an Earth observation satellite. Simulation and experimental results show the effectiveness of our approach.

16:55-17:00, Paper WeET12.5	Add to My Program
FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

Ding, Junyuan	Beihang University
Wang, Ziteng	DVSense (Beijing) Technology Co., Ltd., China
Gao, Chang	Delft University of Technology
Liu, Min	DVSense
Chen, Qinyu	Leiden University
Keywords: Deep Learning for Visual Perception, Gesture, Posture and Facial Expressions, Sensor-based Control Abstract: Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6x and 1.8x compared to the prior art, EV-Eye, with 4.4x and 11.7x less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.

17:00-17:05, Paper WeET12.6	Add to My Program
EMoE-Tracker: Environmental MoE-Based Transformer for Robust Event-Guided Object Tracking

Chen, Yucheng	Hong Kong University of Science and Technology (GZ)
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Visual Tracking, Sensor Fusion, Deep Learning for Visual Perception Abstract: The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.


WeET13 Regular Session, 316	Add to My Program
Manipulating Challenging Objects

Chair: Khan, Shiraz	University of Delaware
Co-Chair: Kuppuswamy, Naveen	Toyota Research Institute

16:35-16:40, Paper WeET13.1	Add to My Program
Learning Keypoints for Robotic Cloth Manipulation Using Synthetic Data

Lips, Thomas	Ghent University
De Gusseme, Victor-Louis	Ghent University
Wyffels, Francis	Ghent University
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Simulation and Animation Abstract: Assistive robots should be able to wash, fold or iron clothes. However, due to the variety, deformability and self-occlusions of clothes, creating robot systems for cloth manipulation is challenging. Synthetic data is a promising direction to improve generalization, but the sim-to-real gap limits its effectiveness. To advance the use of synthetic data for cloth manipulation tasks such as robotic folding, we present a synthetic data pipeline to train keypoint detectors for almost- flattened cloth items. To evaluate its performance, we have also collected a real-world dataset. We train detectors for both T-shirts, towels and shorts and obtain an average precision of 64% and an average keypoint distance of 18 pixels. Fine-tuning on real-world data improves performance to 74% mAP and an average distance of only 9 pixels. Furthermore, we describe failure modes of the keypoint detectors and compare different approaches to obtain cloth meshes and materials. We also quantify the remaining sim- to-real gap and argue that further improvements to the fidelity of cloth assets will be required to further reduce this gap. The code, dataset and trained models are available online.

16:40-16:45, Paper WeET13.2	Add to My Program
RaggeDi: Diffusion-Based State Estimation of Disordered Rags, Sheets, Towels and Blankets

Ye, Jikai	National University of Singapore
Li, Wanze	Nation University of Singapore
Khan, Shiraz	University of Delaware
Chirikjian, Gregory	University of Delaware
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Tracking Abstract: Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.

16:45-16:50, Paper WeET13.3	Add to My Program
Excavating in the Wild: The GOOSE-Ex Dataset for Semantic Segmentation

Hagmanns, Raphael	Karlsruhe Institute of Technology
Mortimer, Peter	Universität Der Bundeswehr München
Granero, Miguel	Fraunhofer IOSB
Luettel, Thorsten	Universität Der Bundeswehr München
Petereit, Janko	Fraunhofer IOSB
Keywords: Data Sets for Robotic Vision, Field Robots, Deep Learning for Visual Perception Abstract: The successful deployment of deep learning-based techniques for autonomous systems is highly dependent on the data availability for the respective system in its deployment environment. Especially for unstructured outdoor environments, very few datasets exist for even fewer robotic platforms and scenarios. In an earlier work, we presented the German Outdoor and Offroad Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad vehicle to enhance the perception capabilities in unstructured environments. In this work, we address the generalizability of the GOOSE framework. To accomplish this, we open-source the GOOSE-Ex dataset, which contains additional 5000 labeled multimodal frames from various completely different environments, recorded on a robotic excavator and a quadruped platform. We perform a comprehensive analysis of the semantic segmentation performance on different platforms and sensor modalities in unseen environments. In addition, we demonstrate how the combined datasets can be utilized for different downstream applications or competitions such as offroad navigation, object manipulation or scene completion. The dataset, its platform documentation and pre-trained state-of-the-art models for offroad perception will be made available on https://goose-dataset.de/.

16:50-16:55, Paper WeET13.4	Add to My Program
Robotic Framework for Iterative and Adaptive Profile Grading of Sand

Hanut, Louis	KU Leuven
Du, Yurui	KU Leuven
Vande Moere, Andrew	KU Leuven
Detry, Renaud	KU Leuven
Bruyninckx, Herman	KU Leuven
Keywords: Robotics and Automation in Construction, Robust/Adaptive Control Abstract: This paper studies sand profile grading, a manipulation task to obtain a desired geometric curve in sand. Manipulating sand is challenging because like other amorphous materials, its properties are difficult to estimate and emergent effects such as collapses may occur which both influence the manipulation outcome. To tackle these challenges, humans iterate and adapt their manual actions to the observed material states. In this paper, we propose to replicate this adaptive and iterative approach on a robotic profile grading task. Our results demonstrate that (1) tool insertion adaptation reduces force limit violations during tool-material interactions, (2) grading angle adaptation ensures no undercutting or collisions while allowing for cutting or smoothing the sand profile, and (3) adapting progress speed to task evolution provides a balance between grading precision and execution time. This paper’s findings pave the way for generalized and transferable robotic systems manipulating various amorphous materials and automating a larger set of construction tasks and beyond.

16:55-17:00, Paper WeET13.5	Add to My Program
Autonomous Excavation of Challenging Terrain Using Oscillatory Primitives and Adaptive Impedance Control

Franceschini, Noah	University of Illinois Urbana-Champaign
Thangeda, Pranay	University of Illinois Urbana-Champaign
Ornik, Melkior	University of Illinois Urbana-Champaign
Hauser, Kris	University of Illinois at Urbana-Champaign
Keywords: Robotics and Automation in Construction, Compliance and Impedance Control, Mining Robotics Abstract: This paper addresses the challenge of autonomous excavation of challenging terrains, in particular those that are prone to jamming and inter-particle adhesion when tackled by a standard penetrate-drag-scoop motion pattern. Inspired by human excavation strategies, our approach incorporates oscillatory rotation elements -- including swivel, twist, and dive motions -- to break up compacted, tangled grains and reduce jamming. We also present an adaptive impedance control method, the Reactive Attractor Impedance Controller (RAIC), that adapts a motion trajectory to unexpected forces during loading in a manner that tracks a trajectory closely when loads are low, but avoids excessive loads when significant resistance is met. Our method is evaluated on four terrains using a robotic arm, demonstrating improved excavation performance across multiple metrics, including volume scooped, protective stop rate, and trajectory completion percentage.

17:00-17:05, Paper WeET13.6	Add to My Program
Diffusion-Based Self-Supervised Imitation Learning from Imperfect Visual Servoing Demonstrations for Robotic Glass Installation

Xiao, Canran	Central South University
Hou, Liwei	Central South University
Fu, Ling	Zoomlion
Chen, Wenrui	Hunan University
Keywords: Robotics and Automation in Construction, AI-Based Methods, Imitation Learning Abstract: Heavy-duty glass installation is a high-risk, precision-critical task in modern construction, traditionally performed through labor-intensive and error-prone manual methods. This paper presents a novel robotic framework that leverages diffusion-based self-supervised imitation learning from imperfect visual servoing demonstrations to achieve safe and precise glass installation. Specifically, our approach employs noisy and suboptimal demonstration data obtained via visual servoing to train a Denoising Diffusion Probabilistic Model (DDPM). This model iteratively refines installation trajectories, transforming them into smooth, precise, and collision-free movements. Extensive experiments demonstrate that our method significantly surpasses conventional visual servoing and standard imitation learning baselines in terms of success rate, precision, and installation efficiency, while markedly improving operational safety. Our results establish a new benchmark for automating complex, high-risk tasks in construction robotics.

17:05-17:10, Paper WeET13.7	Add to My Program
A Global-Local Graph Attention Network for Deformable Linear Objects Dynamic Interaction with Environment

Chu, Jian	Hefei University of Technology
Zhang, Wenkang	Anhui Agricultural University
Ouyang, Bo	Hefei University of Technology
Tian, Kunmiao	Hefei University of Technology
Zhang, Shuai	Hefei University of Technology
Zhai, Kai	Hefei University of Technology
Keywords: Dynamics, Collision Avoidance, Deep Learning Methods Abstract: Accurately modeling the interactions between deformable linear objects (DLOs) and their environments is crucial for active deformation control by robot manipulators. Graph Neural Networks (GNNs) have shown immense potential in particle-based simulation of DLOs. However, most existing studies propagate particle information in sequence, ignoring that particle motions, including the distal particle, correlate strongly with each other and the interaction state. In this paper, a global and local attention dynamic simulation model named GladSim is designed based on GNNs and the attention mechanism to aggregate information among particles and focus on the interaction particles for DLO interaction with the environment. Specifically, a global virtual node is proposed to deliver particle information and shorten the propagation path for the first time, which connects all the particles and aggregates global information. When the DLOs and the obstacle boundary particles are close, an edge is established between them to capture the interaction state. Moreover, we group all the particles by k-hop neighbors and design a HopSA module that combines hop attention and self-attention to discover the correlates among adjacent particles. Experimental results on simulation and real-world data show that the proposed GladSim network's predictive accuracy significantly outperforms baseline models, especially in long-term prediction.


WeET14 Regular Session, 402	Add to My Program
Social Navigation 2

Chair: Kosecka, Jana	George Mason University
Co-Chair: Lilienthal, Achim J.	Orebro University

16:35-16:40, Paper WeET14.1	Add to My Program
Generating Causal Explanations of Vehicular Agent Behavioural Interactions with Learnt Reward Profiles

Howard, Rhys Peter Matthew	University of Oxford
Hawes, Nick	University of Oxford
Kunze, Lars	University of Oxford
Keywords: Intelligent Transportation Systems, AI-Enabled Robotics, Agent-Based Systems Abstract: Transparency and explainability are important features that responsible autonomous vehicles should possess, particularly when interacting with humans, and causal reasoning offers a strong basis to provide these qualities. However, even if one assumes agents act to maximise some concept of reward, it is difficult to make accurate causal inferences of agent planning without capturing what is of importance to the agent. Thus our work aims to learn a weighting of reward metrics for agents such that explanations for agent interactions can be causally inferred. We validate our approach quantitatively and qualitatively across three real-world driving datasets, demonstrating a functional improvement over previous methods and competitive performance across evaluation metrics.

16:40-16:45, Paper WeET14.2	Add to My Program
Fast Online Learning of CLiFF-Maps in Changing Environments

Zhu, Yufei	Örebro University
Rudenko, Andrey	Robert Bosch GmbH
Palmieri, Luigi	Robert Bosch GmbH
Heuer, Lukas	Örebro University, Robert Bosch GmbH
Lilienthal, Achim J.	Orebro University
Magnusson, Martin	Örebro University
Keywords: Human Detection and Tracking Abstract: Maps of dynamics are effective representations of motion patterns learned from prior observations, with recent research demonstrating their ability to enhance various downstream tasks such as human-aware robot navigation, long-term human motion prediction, and robot localization. Current advancements have primarily concentrated on methods for learning maps of human flow in environments where the flow is static, i.e., not assumed to change over time. In this paper we propose an online update method of the CLiFF-map (an advanced map of dynamics type that models motion patterns as velocity and orientation mixtures) to actively detect and adapt to human flow changes. As new observations are collected, our goal is to update a CLiFF-map to effectively and accurately integrate them, while retaining relevant historic motion patterns. The proposed online update method maintains a probabilistic representation in each observed location, updating parameters by continuously tracking sufficient statistics. In experiments using both synthetic and real-world datasets, we show that our method is able to maintain accurate representations of human motion dynamics, contributing to high performance flow-compliant planning downstream tasks, while being orders of magnitude faster than the comparable baselines.

16:45-16:50, Paper WeET14.3	Add to My Program
A Hybrid Approach to Indoor Social Navigation: Integrating Reactive Local Planning and Proactive Global Planning

Debnath, Arnab	George Mason University
Stein, Gregory	George Mason University
Kosecka, Jana	George Mason University
Keywords: Human-Aware Motion Planning, Collision Avoidance Abstract: We consider the problem of indoor building-scale social navigation, where the robot must reach a point goal as quickly as possible without colliding with humans who are freely moving around. Factors such as varying crowd densities, unpredictable human behavior, and the constraints of indoor spaces add significant complexity to the navigation task, necessitating a more advanced approach. We propose a modular navigation framework that leverages the strengths of both classical methods and deep reinforcement learning (DRL). Our approach employs a global planner to generate waypoints, assigning soft costs around anticipated pedestrian locations, encouraging caution around potential future positions of humans. Simultaneously, the local planner, powered by DRL, follows these waypoints while avoiding collisions. The combination of these planners enables the agent to perform complex maneuvers and effectively navigate crowded and constrained environments while improving reliability. Many existing studies on social navigation are conducted in simplistic or open environments, limiting the ability of trained models to perform well in complex, real-world settings. To advance research in this area, we introduce a new 2D benchmark designed to facilitate development and testing of social navigation strategies in indoor environments. We benchmark our method against traditional and RL-based navigation strategies, demonstrating that our approach outperforms both.

16:50-16:55, Paper WeET14.4	Add to My Program
Overlapping Social Navigation Principles: A Framework for Social Robot Navigation

Ikeda, Bryce	University of North Carolina Chapel Hill
Higger, Mark	Colorado School of Mines
Song, Christina Soyoung	Illinois State University
Trafton, Greg	Naval Research Laboratory
Keywords: Social HRI, Human-Aware Motion Planning Abstract: As autonomous robots become integrated into society, they must socially navigate around humans. We propose that effective social robot navigation relies on three key principles: social norms, perceived safety, and legibility. Our framework, Overlapping Social Navigation Principles, suggests that the strength of each principle is influenced by the presence of other principles. To test our framework, we implemented SRN behaviors on an autonomous robot in a passing scenario and conducted an online study where participants ranked videos of different SRN behavior combinations. Our findings show that incorporating all three principles enhances SRN, with social norms having the greatest impact.

16:55-17:00, Paper WeET14.5	Add to My Program
Relative Velocity-Based Reward Model for Socially-Aware Navigation with Deep Reinforcement Learning

Maddumage, Vinu Vihan	University of Technology Sydney
Kodagoda, Sarath	University of Technology, Sydney
Carmichael, Marc	Centre for Autonomous Systems
Gunatilake, Amal	University of Technology Sydney
Thiyagarajan, Karthick	University of Technology Sydney
Martin, Jodi	Guide Dogs NSW/ACT
Keywords: Human-Aware Motion Planning, Social HRI, Collision Avoidance Abstract: Mobile robots are increasingly deployed in shared environments where they must learn to navigate alongside humans. Deep Reinforcement Learning (DRL) techniques have shown promise in developing navigation policies that account for interactions within crowds, fostering socially acceptable movement. However, these techniques often depend heavily on collision avoidance rewards to ensure safe navigation. In this study, we introduce a novel reward component based on relative velocity for collision avoidance, which integrates both the robot’s and humans’ kinematics within personal distance constraints. We conducted a thorough evaluation comparing this new reward model against a conventional one in simulated environments using advanced DRL methods. Our findings indicate that the proposed reward model improves the robots’ ability to avoid collisions and navigate towards their goals while being socially acceptable.

17:00-17:05, Paper WeET14.6	Add to My Program
SICNav: Safe and Interactive Crowd Navigation Using Model Predictive Control and Bilevel Optimization

Samavi, Sepehr	University of Toronto
Han, James	University of Toronto
Shkurti, Florian	University of Toronto
Schoellig, Angela P.	TU Munich
Keywords: Social Navigation, Collision Avoidance, Autonomous Vehicle Navigation, Optimization and Optimal Control Abstract: Robots need to predict and react to human motions to navigate through a crowd without collisions. Many existing methods decouple prediction from planning, which does not account for the interaction between robot and human motions and can lead to the robot getting stuck. We propose SICNav, a Model Predictive Control (MPC) method that jointly solves for robot motion and predicted crowd motion in closed-loop. We model each human in the crowd to be following an Optimal Reciprocal Collision Avoidance (ORCA) scheme and embed that model as a constraint in the robot’s local planner, resulting in a bilevel nonlinear MPC optimization problem. We use a KKT- reformulation to cast the bilevel problem as a single level and use a nonlinear solver to optimize. Our MPC method can influence pedestrian motion while explicitly satisfying safety constraints in a single-robot multi-human environment. We analyze the performance of SICNav in two simulation environments and indoor experiments with a real robot to demonstrate safe robot motion that can influence the surrounding humans. We also validate the trajectory forecasting performance of ORCA on a human trajectory dataset.


WeET15 Regular Session, 403	Add to My Program
Surgical Robotics: Systems

Chair: Arai, Fumihito	The University of Tokyo
Co-Chair: Zefran, Milos	University of Illinois at Chicago

16:35-16:40, Paper WeET15.1	Add to My Program
Autonomous Continuous Capsulorhexis Based on a Force-Vision-Guided Robot System

Liang, Hongli	Sun Yat-Sen University
Liu, Jiali	Zhongshan Ophthalmic Center, Sun Yat-Sen University
Nasseri, M. Ali	Technische Universitaet Muenchen
Lin, Haotian	Sun Yat-Sen University, Zhongshan Ophthalmic Center
Huang, Kai	Sun Yat-Sen University
Keywords: Medical Robots and Systems Abstract: Capsulorhexis is challenging in cataract surgery, since the size, centering, and circularity of the capsule are important. Those indicators are closely related to the subsequent step of phacoemulsification and the postoperative position of the intraocular lens. It takes 3-5 years for a resident to practice, while the occurrence of deficient capsulorhexis is still inevitable. This paper proposes a robotic system to automate Continuous Curvilinear Capsulorhexis(CCC) in cataract surgery. A typical ophthalmic microscope system and a triaxial force sensor are utilized to guide the robot system with a force-vision method. The constraint of a Remote Center of Motion (RCM) is designed to perform the surgery route. The experimental results on ex-vivo porcine eyes show our autonomous method can achieve a satisfactory 6mm capsule. With an average centering deviation below 76% and circularity of 0.993, the consistency of the capsulorhexis is comparable to a surgeon-made one.

16:40-16:45, Paper WeET15.2	Add to My Program
Ultrasound-Guided Robotic Blood Drawing and in Vivo Studies on Submillimetre Vessels of Rats

Jing, Shuaiqi	Chengdu Aixam Medical Technology Co., Ltd
Yao, Tianliang	Tongji University
Zhang, Ke	Chengdu Aixam Medical Technology Co. Ltd
Wu, Di	University of Southern Denmark
Wang, Qiulin	Chengdu Aixam Medical Technology Co., Ltd
Chen, Zixi	Scuola Superiore Sant'Anna
Chen, Ke	Chengdu Aixam Medical Technology Co., Ltd
Qi, Peng	Tongji University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Service Robotics Abstract: Billions of vascular access procedures are performed annually worldwide, serving as a crucial first step in various clinical diagnostic and therapeutic procedures. For pediatric or elderly individuals, whose vessels are small in size (typically 2 to 3 mm in diameter for adults and <1 mm in children), vascular access can be highly challenging. This study presents an image-guided robotic system aimed at enhancing the accuracy of difficult vascular access procedures. The system integrates a 6-DoF (Degrees of Freedom) robotic arm with a 3-DoF end-effector, ensuring precise navigation and needle insertion. Multi-modal imaging and sensing technologies have been utilized to endow the medical robot with precision and safety, while ultrasound (US) imaging guidance is specifically evaluated in this study. To evaluate in vivo vascular access in submillimeter vessels, we conducted ultrasound-guided robotic blood drawing on the tail veins (with a diameter of 0.7 ± 0.2 mm) of 40 rats. The results demonstrate that the system achieved a first-attempt success rate of 95%. The high first-attempt success rate in intravenous vascular access, even with small blood vessels, demonstrates the system’s effectiveness in performing these procedures. This capability reduces the risk of failed attempts, minimizes patient discomfort, and enhances clinical efficiency.

16:45-16:50, Paper WeET15.3	Add to My Program
Sensory Glove-Based Surgical Robot User Interface

Borgioli, Leonardo	University of Illinois Chicago
Oh, Ki-Hwan	University of Illinois at Chicago
Valle, Valentina	Surgical Innovation and Training Lab, Department of Surgery, Col
Ducas, Alvaro	Surgical Innovation and Training Lab, Department of Surgery, Col
Mohammad Halloum, Mohammad Halloum	Surgical Innovation and Training Lab, Department of Surgery, Col
Diego Federico Mendoza Medina, Diego Federico Mendoza Medina	Surgical Innovation and Training Lab, Department of Surgery, Col
Lopez, Paula	Surgical Innovation and Training Lab, Department of Surgery, Col
Arman Sharifi, Arman Sharifi	Surgical Innovation and Training Lab, Department of Surgery, Col
Cassiani, Jessica	Surgical Innovation and Training Lab, Department of Surgery, Col
Zefran, Milos	University of Illinois at Chicago
Chen, Liaohai	Surgical Innovation and Training Lab, Department of Surgery, Col
Giulianotti, Pier Cristoforo	Surgical Innovation and Training Lab, Department of Surgery, Col
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Telerobotics and Teleoperation Abstract: Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky and take up valuable space in the operating room, present challenges for surgical team coordination, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms intuitively with their hand movements. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and SCOPEYE wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality unavailable in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A qualitative and quantitative evaluation has been conducted comparing the current device to the dVRK console; the system shows that it has excellent tracking accuracy and allows surgeons to efficiently perform common surgical training tasks with minimal practice with the new interface.

16:50-16:55, Paper WeET15.4	Add to My Program
Self-Deformable Magnetic Miniature Robot for Traction Assistance in Endoscopic Submucosal Dissection

Zhang, Bolan	The University of Tokyo
Yamanaka, Toshiro	The University of Tokyo
Shu, Tengo	The University of Tokyo
Liu, Yuxuan	The University of Tokyo
Arai, Fumihito	The University of Tokyo
Keywords: Medical Robots and Systems, Soft Robot Applications Abstract: Between 1999 and 2020, gastrointestinal cancers were responsible for over three million deaths, emphasizing the critical role of minimally invasive surgical techniques like Endoscopic Submucosal Dissection (ESD) in managing such life-threatening conditions. ESD, which dissects the connective tissue between the mucosal and muscular layers using an electrosurgical knife connected to an endoscope, requires a constant traction force to stabilize tissues and expose underlying anatomical structures. This paper introduces a miniature magnetic flexible robot, actuated by a permanent magnet on a robotic manipulator, designed to enhance ESD by providing traction forces consistently on lesions. The robot was fabricated by casting magnetic silicone composites, and its safe deployment through the endoscope instrument channel was successfully demonstrated, avoiding tissue contact. Experiments in a rubber intestine model validated the feasibility of providing constant traction and 2 DOF orientation control via the robot, allowing real-time fine-tuning of the force direction. This reduces the difficulty and improves the precision and safety of ESD. This research presents a practical method for achieving stable force output in medical miniature robots, particularly in gastrointestinal procedures.

16:55-17:00, Paper WeET15.5	Add to My Program
Variable-Stiffness Nasotracheal Intubation Robot with Passive Buffering: A Modular Platform in Mannequin Studies

Hao, Ruoyi	The Chinese University of Hong Kong
Lai, Jiewen	The Chinese University of Hong Kong
Zhong, Wenqi	The Chinese University of Hong Kong
Xie, Dihong	The Chinese University of Hong Kong
Tian, Yu	The Chinese University of Hong Kong
Zhang, Tao	Chinese University of Hong Kong
Zhang, Yang	Hubei University of Technology
Chan, Catherine Po Ling	The Chinese University of Hong Kong
Chan, Jason Ying-Kuen	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Mechanism Design Abstract: Intubation is a critical medical procedure for securing airway patency in patients, but the inconsistent skill levels among medical practitioners necessitate the advancement of better robotic solutions. While orotracheal intubation robots have been widely developed, nasotracheal intubation remains essential in specific clinical scenarios. However, nasotracheal intubation robots are still underdeveloped and lack buffer protection mechanisms to ensure safety. This study presents a novel variable-stiffness nasotracheal intubation robot (NIR) with passive buffering. The proposed NIR is a modular platform capable of performing the main steps of nasotracheal intubation, validated through mannequin studies via teleoperation. We proposed a variable-stiffness fiberoptic bronchoscope (FOB) control module for the FOB distal end control, and validated its dual functionality in experiments: low-stiffness mode provides passive buffering during nasal cavity navigation, with a frontal peak force of 2.8 N and a lateral peak force of 0.12 N; high-stiffness mode enhances load-bearing capacity for near-glottis navigation, with a frontal bearing force of 4.9 N and a lateral bearing force of 0.42 N. Additionally, a compact (74 × 64 × 53 mm, 150 g) FOB feeding module with passive failure protection was designed to limit the max frontal impact force to 2.3 N.

17:00-17:05, Paper WeET15.6	Add to My Program
SurgPose: A Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking

Wu, Zijian	The University of British Columbia
Schmidt, Adam	Intuitive Surgical
Moore, Randy	The University of British Columbia
Zhou, Haoying	Worcester Polytechnic Institute
Banks, Alexandre	University of New Brunswick
Kazanzides, Peter	Johns Hopkins University
Salcudean, Septimiu E.	University of British Columbia
Keywords: Data Sets for Robotic Vision, Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics Abstract: Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120K surgical instrument instances of 6 categories as shown in Fig. 1. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we tested a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.

17:05-17:10, Paper WeET15.7	Add to My Program
On High Performance Control of Concentric Tube Continuum Robots through Parsimonious Calibration

Boyer, Quentin	UBFC
Voros, Sandrine	TIMC-IMAG Laboratory
Roux, Pierre	FEMTO-ST Institute
Marionnet, François	FEMTO-ST Institute
Rabenorosoa, Kanty	Univ. Bourgogne Franche-Comté, CNRS
Chikhaoui, M. Taha	CNRS - Univ. Grenoble Alpes
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Calibration and Identification Abstract: Continuum robots deform continuously, compared to conventional robots composed of rigid links and joints, and require dedicated calibration methods. Indeed, calibration is an essential step to obtain high performance control, as it directly influences robot accuracy. In this paper, we investigate how model parameters influence both model accuracy and model-based closed-loop control accuracy of Concentric Tube Continuum Robots (CTCR). A fast, robust, and real-time implementation of the Cosserat rod model is first introduced. Then, a model-based Jacobian control scheme is presented. A parsimonious calibration procedure focused on control accuracy is finally proposed to achieve submillimetric tracking errors along a 3D trajectory at velocity reaching 5 mm/s in complex scenarios including actuation constraints, obstacle avoidance, and external forces. Results are demonstrated both in simulation and on an experimental setup of a 3-tube CTCR.


WeET16 Regular Session, 404	Add to My Program
Deformable Objects

Chair: Li, Yunzhu	Columbia University
Co-Chair: Iordachita, Ioan Iulian	Johns Hopkins University

16:35-16:40, Paper WeET16.1	Add to My Program
Deformation Control of a 3D Soft Object Using RGB-D Visual Servoing and FEM-Based Dynamic Model

Ouafo Fonkoua, Mandela	Inria Centre at Rennes University
Chaumette, Francois	Inria Center at University of Rennes
Krupa, Alexandre	Centre Inria De l'Université De Rennes
Keywords: Visual Servoing, Dexterous Manipulation Abstract: In this letter, we present a visual control framework for accurately positioning feature points belonging to the surface of a 3D deformable object to desired 3D positions, by acting on a set of manipulated points using a robotic manipulator. Notably, our framework considers the dynamic behavior of the object deformation, that is, we do not assume that the object is in its static equilibrium during the manipulation. By relying on a coarse dynamic Finite Element Model (FEM), we have successfully formulated the analytical relationship expressing the motion of the feature points to the six degrees of freedom (6~DOF) motion of a robot gripper. From this modeling step, a novel closed-loop deformation controller is designed. To be robust against model approximations, the whole shape of the object is tracked in real-time using an RGB-D camera, thus allowing to correct any drift between the object and its model on-the-fly. Our model-based and vision-based controller has been validated in real experiments. The results highlight the effectiveness of the proposed methodology.

16:40-16:45, Paper WeET16.2	Add to My Program
Real-Time Deformation-Aware Control for Autonomous Robotic Subretinal Injection Based on OCT Guidance

Arikan, Demir	Technical University Munich
Zhang, Peiyao	Johns Hopkins University
Sommersperger, Michael	Technical University of Munich
Dehghani, Shervin	TUM
Esfandiari, Mojtaba	Johns Hopkins University
Taylor, Russell H.	The Johns Hopkins University
Nasseri, M. Ali	Technische Universitaet Muenchen
Gehlbach, Peter	Johns Hopkins Medical Institute
Navab, Nassir	TU Munich
Iordachita, Ioan Iulian	Johns Hopkins University
Keywords: Vision-Based Navigation, Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Robotic platforms provide consistent and precise tool positioning that significantly enhances retinal microsurgery. Integrating such systems with intraoperative optical coherence tomography (iOCT) enables image-guided robotic interventions, allowing autonomous performance of advanced treatments, such as injecting therapeutic agents into the subretinal space. However, tissue deformations due to tool-tissue interactions constitute a significant challenge in autonomous iOCT-guided robotic subretinal injections. Such interactions impact correct needle positioning and procedure outcomes. This paper presents a novel method for autonomous subretinal injection under iOCT guidance that considers tissue deformations during the insertion procedure. The technique is achieved through real-time segmentation and 3D reconstruction of the surgical scene from densely sampled iOCT B-scans, which we refer to as B⁵-scans. Using B⁵-scans we monitor the position of the instrument relative to a virtual target layer between the ILM and RPE. Our experiments on ex-vivo porcine eyes demonstrate dynamic adjustment of the insertion depth and overall improved accuracy in needle positioning compared to prior autonomous insertion approaches. Compared to a 35% success rate in subretinal bleb generation with previous approaches, our method reliably created subretinal blebs in 90% our experiments. The source code and data used in this study are publicly available on GitHub¹.

16:45-16:50, Paper WeET16.3	Add to My Program
6-DoF Shape Servoing of Deformable Objects in Co-Rotated Space of Modal Graph

Yang, Bohan	The Chinese University of Hong Kong
Huang, Tianyu	The Chinese University of Hong Kong
Zhong, Fangxun	The Chinese University of Hong Kong, Shenzhen
Liu, Yunhui	Chinese University of Hong Kong
Keywords: Visual Servoing, Dexterous Manipulation, Robust/Adaptive Control Abstract: Shape control of deformable objects under both rotational and translational deformations is important for versatile robotic applications. However, deformation control with full 6-degree-of-freedom (DoF) manipulation is an open problem, since modeling and describing rotational deformations lead to significant challenges. To tackle the problem, this paper proposes a novel method by introducing a co-rotated space for the modal graph representation of objects with unknown physical and geometric models. In this space, we design new deformation features that can encode local rotations while preserving a compact and low-frequency shape representation. Moreover, these features can be mapped analytically to the robot manipulation, enabling the design of adaptive control laws with guaranteed stability for unmodeled objects. Experiments on complex volumetric objects demonstrate the effectiveness and advantages of our method with raw, noisy, and unregistered point clouds. The results highlight the importance of integrating the co-rotated features to address rotational deformations.

16:50-16:55, Paper WeET16.4	Add to My Program
Deformable Gaussian Splatting for Efficient and High-Fidelity Reconstruction of Surgical Scenes

Shan, Jiwei	The Chinese University of Hong Kong
Cai, Zeyu	Shanghai Jiao Tong University
Hsieh, Cheng-Tai	Shanghai Jiao Tong University
Han, Lijun	Shanghai Jiao Tong University
Cheng, Shing Shin	The Chinese University of Hong Kong
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy Abstract: Efficient and high-fidelity reconstruction of deformable surgical scenes is a critical yet challenging task. Building on recent advancements in 3D Gaussian splatting, current methods have seen significant improvements in both reconstruction quality and rendering speed. However, two major limitations remain: (1) difficulty in handling irreversible dynamic changes, such as tissue shearing, which are common in surgical scenes; and (2) the lack of hierarchical modeling for surgical scene deformation, which reduces rendering speed. To address these challenges, we introduce EH-SurGS, an efficient and high-fidelity reconstruction algorithm for deformable surgical scenes. We propose a deformation modeling approach that incorporates the life cycle of 3D Gaussians, effectively capturing both regular and irreversible deformations, thus enhancing reconstruction quality. Additionally, we present an adaptive motion hierarchy strategy that distinguishes between static and deformable regions within the surgical scene. This strategy reduces the number of 3D Gaussians passing through the deformation field, thereby improving rendering speed. Extensive experiments on public datasets captured with static endoscopes demonstrate that our method surpasses existing state-of-the-art approaches in both reconstruction quality and rendering speed. Ablation studies further validate the effectiveness and necessity of our proposed components. We will open-source our code upon acceptance of the paper.

16:55-17:00, Paper WeET16.5	Add to My Program
One-Shot Video Imitation Via Parameterized Symbolic Abstraction Graphs

Wang, Jianren	Carnegie Mellon University
Liu, Kangni	Carnegie Mellon University
Guo, Dingkun	Carnegie Mellon University
Xian, Zhou	Carnegie Mellon University
Atkeson, Christopher	CMU
Keywords: Learning from Demonstration, Simulation and Animation Abstract: Learning to manipulate dynamic and deformable objects from a single demonstration video holds great promise in terms of scalability. Previous approaches have predominantly focused on either replaying object relationships or actor trajectories. The former often struggles to generalize across diverse tasks, while the latter suffers from data inefficiency. Moreover, both methodologies encounter challenges in capturing invisible physical attributes, such as forces. In this paper, we propose to interpret video demonstrations through a series of Parameterized Symbolic Abstraction Graphs (PSAGs), where nodes represent objects and edges denote relationships between objects. We further ground geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes. The augmented PSAGs are then applied in real robot experiments. Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza. We demonstrate successful generalization to novel objects with distinct visual and physical properties. For visualizations of the learned policies please check: https://jianrenw.com/PSAG/

17:00-17:05, Paper WeET16.6	Add to My Program
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

Liu, Zixian	Tsinghua University
Zhang, Mingtong	UIUC
Li, Yunzhu	Columbia University
Keywords: Machine Learning for Robot Control, Model Learning for Control, Deep Learning in Grasping and Manipulation Abstract: With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a textit{keypoint-based target specification} is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at url{http://kuda-dynamics.github.io/}.

17:05-17:10, Paper WeET16.7	Add to My Program
DLO Perceiver: Grounding Large Language Model for Deformable Linear Objects Perception

Caporali, Alessio	University of Bologna
Galassi, Kevin	Università Di Bologna
Palli, Gianluca	University of Bologna
Keywords: Computer Vision for Manufacturing, Deep Learning for Visual Perception, Recognition Abstract: The perception of Deformable Linear Objects (DLOs) is a challenging task due to their complex and ambiguous appearance, lack of discernible features, typically small sizes, and deformability. Despite these challenges, achieving a robust and effective segmentation of DLOs is crucial to introduce robots into environments where they are currently underrepresented, such as domestic and complex industrial settings. In this context, the integration of language-based inputs can simplify the perception task while also enabling the possibility of introducing robots as human companions. Therefore, this paper proposes a novel architecture for the perception of DLOs, wherein the input image is augmented with a text-based prompt guiding the segmentation of the target DLO. After encoding the image and text separately, a Perceiver-inspired structure is exploited to compress the concatenated data into transformer layers and generate the output mask from a latent vector representation. The method is experimentally evaluated on real-world images of DLOs like electrical cables and ropes, validating its efficacy and efficiency in real practical scenarios.


WeET17 Regular Session, 405	Add to My Program
Large Models for Autonomous Vehicles

Chair: Billah, Syed	Pennsylvania State University
Co-Chair: Chung, Soon-Jo	Caltech

16:35-16:40, Paper WeET17.1	Add to My Program
Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator

Kou, Wei-Bin	The University of Hong Kong
Zhu, Guangxu	Shenzhen Research Institute of Big Data
Ye, Rongguang	Southern University of Science and Technology
Wang, Shuai	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Tang, Ming	Southern University of Science and Technology
Wu, Yik-Chung	The University of Hong Kong
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Intelligent Transportation Systems Abstract: Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.

16:40-16:45, Paper WeET17.2	Add to My Program
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

Kabir, Imran	Pennsylvania State University
Reza, Md Alimoor	Drake University
Billah, Syed	Pennsylvania State University
Keywords: Semantic Scene Understanding, Multi-Modal Perception for HRI, Formal Methods in Robotics and Automation Abstract: Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 50% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at: https://github.com/Imran2205/LogicRAG.

16:45-16:50, Paper WeET17.3	Add to My Program
Discrete Contrastive Learning for Diffusion Policies in Autonomous Driving

Kujanpää, Kalle	Aalto University
Baimukashev, Daulet	Aalto University
Munir, Farzeen	Aalto University, Finnish Center for Artificial Intelligence
Azam, Shoaib	Aalto University, Finnish Center for Artificial Intelligence (FC
Kucner, Tomasz Piotr	Aalto University
Pajarinen, Joni	Aalto University
Kyrki, Ville	Aalto University
Keywords: Intelligent Transportation Systems, Modeling and Simulating Humans, Learning from Demonstration Abstract: Learning to perform accurate and rich simulations of human driving behaviors from data for autonomous vehicle testing remains challenging due to human driving styles' high diversity and variance. We address this challenge by proposing a novel approach that leverages contrastive learning to extract a dictionary of driving styles from pre-existing human driving data. We discretize these styles with quantization, and the styles are used to learn a conditional diffusion policy for simulating human drivers. Our empirical evaluation confirms that the behaviors generated by our approach are both safer and more human-like than those of the machine-learning-based baseline methods. We believe this has the potential to enable higher realism and more effective techniques for evaluating and improving the performance of autonomous vehicles.

16:50-16:55, Paper WeET17.4	Add to My Program
Intelligence Evaluation Methods for Autonomous Vehicles

Zhou, Junjie	Shanghai Jiao Tong University
Wang, Lin	Shanghai Jiao Tong University
Meng, Qiang	National University of Singapore
Wang, Xiaofan	Shanghai University
Keywords: Intelligent Transportation Systems, Performance Evaluation and Benchmarking, Autonomous Agents Abstract: The rapid advancement of artificial intelligence has significantly enhanced the intelligence of autonomous vehicles (AVs). However, owing to the complexity of AV behavior and the high dimensionality of driving environments, the objective and practical quantitative evaluation of AV intelligence remains a significant and unresolved challenge. This paper proposes a robust training-based comprehensive evaluation (RTCE) system specifically designed to assess the intelligence of AVs in the time dimension. Beginning with a foundation model, the first generation of AVs is developed by training in the initial naturalistic traffic scenarios. To effectively test the intelligence of the AVs, we propose an adversarial trajectory optimization technique to generate challenging, critical test scenarios that evaluate the learning capabilities of AVs in complex environments. Through robust training in these complex scenarios, the second generation of AVs is obtained. To objectively and effectively quantify the intelligence of AVs, we further propose a comprehensive evaluation metric system encompassing five dimensions and 14 evaluation metrics. The intelligence score of each AV is computed using the objective multi-criteria decision-making approach. The proposed intelligence evaluation method is validated using various self-evolution autonomous driving algorithms. The results demonstrate that the RTCE method can quantitatively and effectively test the intelligence of AVs in a multi-dimensional and automated manner. Furthermore, the proposed method is flexible and generalizable, making it adaptable to different testing platforms and autonomous driving algorithms.

16:55-17:00, Paper WeET17.5	Add to My Program
NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-And-Language Navigation

Liu, Haoran	Peking University
Wan, Weikang	Peking University
Yu, Xiqian	University of Science and Technology of China
Li, Minghan	Galbot
Zhang, Jiazhao	Peking University
Zhao, Bo	Shanghai Jiao Tong University
Chen, Zhibo	University of Science and Technology of China
Wang, Zhongyuan	BAAI
Zhang, Zhizheng	University of Science and Technology of China
Wang, He	Peking University
Keywords: AI-Based Methods, Autonomous Agents, Vision-Based Navigation Abstract: Understanding and reasoning about the 4D space-time is crucial for Vision-and-Language Navigation (VLN). However, previous works lack in-depth exploration in this aspect, resulting in bottlenecked spatial perception and action precision of VLN agents. In this work, we introduce NaVid-4D, a Vision Language Model (VLM) based navigation agent taking the lead in explicitly showcasing the capabilities of spatial intelligence in the real world. Given natural language instructions, NaVid-4D requires only egocentric RGB-D video streams as observations to perform spatial understanding and reasoning for generating precise instruction-following robotic actions. NaVid-4D learns navigation policies using the data from simulation environments and is endowed with precise spatial understanding and reasoning capabilities using web data. Without the need to pre-train an RGB-D foundation model, we propose a method capable of directly injecting the depth features into the visual encoder of a VLM. We further compare the use of factually captured depth information with the monocularly estimated one and find NaVid-4D works well with both while using estimated depth offers greater generalization capability and better mitigates the sim-to-real gap. Extensive experiments demonstrate that NaVid-4D achieves state-of-the-art performance in simulation environment and makes impressive VLN performance with spatial intelligence happen in the real world.

17:00-17:05, Paper WeET17.6	Add to My Program
Generating Out-Of-Distribution Scenarios Using Language Models

Aasi, Erfan	Massachusetts Institute of Technology
Nguyen, Phat	University of Massachusetts Amherst
Sreeram, Shiva	MIT
Rosman, Guy	Massachusetts Institute of Technology
Karaman, Sertac	Massachusetts Institute of Technology
Rus, Daniela	MIT
Keywords: AI-Based Methods Abstract: The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-Of-Distribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle’s autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving datasets. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new "OOD-ness" metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.

17:05-17:10, Paper WeET17.7	Add to My Program
MAGIC-VFM - Meta-Learning Adaptation for Ground Interaction Control with Visual Foundation Models

Lupu, Elena-Sorina	California Institute of Technology
Xie, Fengze	California Institute of Technology
Preiss, James	Caltech
Alindogan, Jedidiah	California Institute of Technology
Anderson, Matthew	Caltech
Chung, Soon-Jo	Caltech
Keywords: Model Learning for Control, Learning and Adaptive Systems, Field Robots, Visual Foundation Models Abstract: Control of off-road vehicles is challenging due to the complex dynamic interactions with the terrain. Accurate modeling of these interactions is important to optimize driving performance, but the relevant physical phenomena are too complex to model from first principles. Therefore, we present an offline meta-learning algorithm to construct a rapidly-tunable model of residual dynamics and disturbances. Our model processes terrain images into features using a visual foundation model (VFM), then maps these features and the vehicle state to an estimate of the current actuation matrix using a deep neural network (DNN). We then combine this model with composite adaptive control to modify the last layer of the DNN in real time, accounting for the remaining terrain interactions not captured during offline training. We provide mathematical guarantees of stability and robustness for our controller, and demonstrate the effectiveness of our method through simulations and hardware experiments with a tracked vehicle and a car-like robot. We evaluate our method outdoors on different slopes with varying slippage and actuator degradation disturbances, and compare against an adaptive controller that

17:10-17:15, Paper WeET17.8	Add to My Program
DINO-MOT: 3D Multi-Object Tracking with Visual Foundation Model for Pedestrian Re-Identification Using Visual Memory Mechanism

Lee, Min Young	National University of Singapore
Lee, Christina Dao Wen	National University of Singapore
Jianghao, Li	National University of Singapore
Ang Jr, Marcelo H	National University of Singapore
Keywords: Intelligent Transportation Systems, Human Detection and Tracking, Deep Learning for Visual Perception Abstract: In the advancing domain of autonomous driving, this research focuses on enhancing 3D Multi-Object Tracking (3D-MOT). Pedestrians are particularly vulnerable in urban environments, and robust tracking methodologies are required to understand their movements. Prevalent Tracking-By-Detection (TBD) frameworks often underutilize the rich visual data from sensors such as cameras. This study leverages the advanced visual foundation model, DINOv2, to refine the TBD framework by incorporating camera modality, thereby improving pedestrian tracking consistency and overall 3D-MOT performance. The proposed DINO-MOT framework is the first application of DINOv2 for enhancing 3D-MOT through pedestrian Re-Identification (Re-ID), and Score Filter Ceiling is implemented to prevent premature exclusion of low-confidence 3D detections during tracking association. Furthermore, utilization of DINOv2 as a feature extractor within the DINO-MOT framework reduces pedestrian ID switches by up to 12.3%. Achieving AMOTA of 76.3% on the nuScenes test dataset, DINO-MOT has set a new benchmark in the 3D MOT literature with an improvement of 0.5%, securing the top rank on the leaderboard. Furthermore, this research paves the potential of applying a visual foundation model to improve the existing TBD framework, to enhance 3D-MOT in autonomous driving.


WeET18 Regular Session, 406	Add to My Program
Surgical Robotics: Steerable Catheters/Needles 2

Chair: Hoelscher, Janine	Clemson
Co-Chair: Krieger, Axel	Johns Hopkins University

16:35-16:40, Paper WeET18.1	Add to My Program
Hysteresis Compensation of Tendon-Sheath Mechanism Using Nonlinear Programming Based on Preisach Model

Kim, Hongmin	Massachusetts Institute of Technology
Kim, Dongchan	KAIST (Korea Advanced Institute of Science and Technology)
Park, Su Hyeon	Pusan National University
Jin, Sangrok	Pusan National University
Keywords: Tendon/Wire Mechanism, Medical Robots and Systems, Surgical Robotics: Laparoscopy Abstract: Tendon sheath mechanism (TSM) is an essential mechanical element for the implementation of flexible endoscopic systems owing to its small volume and simple structure. However, nonlinear characteristics, such as backlash, hysteresis and friction occur when employing such a component. In this study, we formulate a Preisach hysteresis model consisting of elementary hysteresis operators. Subsequently, we propose a compensation algorithm that repeatedly and sequentially solves a nonlinear optimization problem online, producing an inverse control signal for the desired output at every time step, compensating the nonlinear effects of TSM. The results indicate that the presented model and control scheme are promising for motion control in any application utilizing TSM.

16:40-16:45, Paper WeET18.2	Add to My Program
Resolution Optimal Motion Planning for Medical Needle Steering from Airway Walls in the Lung

Hoelscher, Janine	Clemson
Fried, Inbar	University of North Carolina at Chapel Hill
Salzman, Oren	Technion
Alterovitz, Ron	University of North Carolina at Chapel Hill
Keywords: Surgical Robotics: Planning, Surgical Robotics: Steerable Catheters/Needles, Nonholonomic Motion Planning Abstract: Steerable needles are novel medical devices capable of following curved paths through tissue, enabling them to avoid anatomical obstacles and steer to hard-to-reach sites in tissue, including targets in the lung for lung cancer diagnosis. Steerable needles are typically deployed into tissue from an insertion surface, and selecting the insertion site is critical for procedure success as it determines which paths the needle can take to its target. Prior motion planners for steerable needles typically only plan from a specific start pose to the target. We introduce a new resolution-optimal steerable needle motion planner that efficiently finds plans from an insertion surface to a target position, handling additional degrees of freedom at both the start and the target. Our algorithm systematically builds a search tree consisting of needle motion primitives backward from the target towards the insertion surface, which allows it to provide an optimality guarantee up to the resolution of the primitives. The algorithm finds higher-quality plans faster than prior state-of-the-art motion planners, as demonstrated in anatomical scenario simulations in the lung.

16:45-16:50, Paper WeET18.3	Add to My Program
Self-Sufficient 5-DoF Discrete Global Localization for Magnetically-Actuated Endoscope in Bronchoscopy

Tan, Jiewen	The Chinese University of Hong Kong
Zhao, Da	The Chinese University of Hong Kong
Zhou, Rui	The Chinese University of Hong Kong
Xie, Wenxuan	The Chinese University of Hong Kong
Cheng, Shing Shin	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles Abstract: Existing sensor-based global localization methods limit the miniaturization potential of magnetically-actuated endoscopes (MAE) while localization based on external medical imaging demands accurate registration and imposes a variety of modality-specific challenges during continuous image acquisition. This work proposes a novel self-sufficient method for discrete (one-time) global localization of an MAE based solely on inherent endoscopic images without any prior MAE pose information. More specifically, it adopts a model-free control approach to determine five different external magnet (EM) poses (corresponding to five independent nonlinear equations) that can align the MAE image center with the lumen center while the MAE maintains the same pose. The five degree-of freedom (DoF) global pose of the MAE can then be estimated by minimizing the root mean square of MAE's torque balance residuals under these EM poses. Our proposed method achieves similar accuracy as other sensor-based methods for permanent magnet-driven MAE with 6.7 ± 2.1 mm position error and 9.5 ± 2.9° orientation error in the experiments. Compared to existing methods, our approach does not require physical sensor integration, enabling a more compact endoscope design for exploration in narrower respiratory tracts. It also offers a critical step toward achieving sensorless and continuous global localization of the permanent magnet-driven MAE during its autonomous navigation.

16:50-16:55, Paper WeET18.4	Add to My Program
Intraoperative 3D Shape Estimation of Magnetic Soft Guidewire

Zhao, Yiting	Beijing Institute of Technology
Shi, Liwei	Beijing Institute of Technology
Xiao, Nan	Beijing Institute of Technology
Keywords: Surgical Robotics: Planning, Soft Robot Applications, Sensor Fusion Abstract: 本文介绍了一种 3D 形状重建技术用于血管内手术中的介入装置，利用灵活的磁性尖端导丝，保持标准导丝的基本属性。我们开发了一个将磁头形状相关联的模型周围磁场分布为通过磁场估计形状。这磁性磁场分布和磁导丝的形状为直接形状估计带来了挑战。要解决为此，我们将图像和物理约束合并到简化估算过程。此方法显示高形状估计的准确性和稳定性，带均值根平方误差（RMSE）和豪斯多夫距离（HD）均在下方 1 毫米，这比其他现有的估计要好方法。值得注意的是，介入导丝不需要嵌入式传感器或布线，以及透视图像使用的是临床实践中的标准。重建过程不ߩ

16:55-17:00, Paper WeET18.5	Add to My Program
Semi-Autonomous 2.5D Control of Untethered Magnetic Suture Needle

Wang, Qinhan	Johns Hopkins University
Bhattacharjee, Anuruddha	The Johns Hopkins University
Chen, Xinhao	Johns Hopkins University
Mair, Lamar	Weinberg Medical Physics, Inc
Diaz-Mercado, Yancy	University of Maryland
Krieger, Axel	Johns Hopkins University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Manipulation Planning Abstract: Untethered miniature surgical tools could significantly reduce invasiveness and enhance patient outcomes in robot-assisted laparoscopic surgical procedures. This paper demonstrates the feasibility of performing semi-autonomous suturing tasks using an untethered magnetic needle controlled by an external electromagnetic manipulator. The electromagnetic manipulator can generate magnetic torques and gradient-based pulling forces to actuate the magnetic needle. Here, we develop and implement a semi-autonomous 2.5D control method for controlling the in-plane position and both in-plane and out-of-plane orientations of a magnetic needle for suturing on tissue-mimicking agar gel phantoms. The method includes recognizing needles and incisions, planning trajectory, and performing suturing with visual feedback control. We conduct two mock suturing tasks using both continuous and interrupted techniques on 1% agar gel phantoms with 2 cm and 3 cm incision sizes. The results demonstrate precise needle control, with an average root-mean-square position error of 1.01 mm and 1.12 mm across tasks. The system also achieved submillimeter-level suture spacing accuracy, comparable to surgeons using state-of-the-art surgical robots. These findings highlight the feasibility of using untethered magnetic suture needles for minimally invasive suturing procedures.

17:00-17:05, Paper WeET18.6	Add to My Program
Steerable Tape-Spring Needle for Autonomous Sharp Turns through Tissue

Abdoun, Omar	University of Pennsylvania
Tjandra, Davin	University of Pennsylvania
Yin, Katie	University of California, Riverside
Kurzan, Pablo	University of Pennsylvania
Yin, Jessica	University of Pennsylvania
Yim, Mark	University of Pennsylvania
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Surgical Robotics: Laparoscopy Abstract: Steerable needles offer a minimally invasive method to deliver treatment to hard-to-reach tissue regions. We introduce a new class of textit{tape-spring} steerable needles, capable of sharp turns ranging from 15 to 150 degrees with a turn radius of as low as 3mm, which minimize surrounding tissue damage. In this work, we derive and experimentally validate a geometric model for our steerable needle design. We evaluate both manual and robotic steering of the needle along a Dubins path in 7 kPa and 13 kPa tissue phantoms, simulating our target clinical application in healthy and unhealthy liver tissue. We conduct experiments to measure needle robustness to stiffness transitions between non-homogeneous tissues. We demonstrate progress towards clinical use with needle tip tracking via ultrasound imaging, navigation around anatomical obstacles, and integration with a robotic autonomous steering system.

17:05-17:10, Paper WeET18.7	Add to My Program
Shape Control of Concentric Tube Robots Via Approximate Follow-The-Leader Motion

Xu, Yunti	University of California, San Diego
Watson, Connor	Morimoto Lab, UCSD
Lin, Jui-Te	University of California, San Diego
Hwang, John T.	University of California, San Diego
Morimoto, Tania K.	University of California San Diego
Keywords: Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems Abstract: Concentric tube robots (CTRs) are miniaturized continuum robots that are promising for robotic minimally invasive surgeries. Control methods to date have primarily focused on controlling the robot tip. However, small changes in the tip position can result in large deviations in the shape of the robot body, motivating the need for shape control to ensure safe navigation in constrained environments. One proposed method for shape control, known as follow-the-leader (FTL) motion, allows the robot to deploy while occupying minimal volume but is limited to specific CTR designs and deployment sequences. In this paper, we propose a shape control method that approximates FTL motion and is applicable to arbitrary tip navigation tasks without requiring a predefined trajectory or specific tube design. This shape control method is framed as a nonlinear optimization problem, and through linearization of the CTR's kinematics, we turn it into a quadratic program solved by a shape controller that requires minimal knowledge of the robot's shape. Simulation results show that our proposed shape control method enables better approximate FTL motion compared to a state-of-the-art Jacobian-based tip controller across different tube sets and tip paths while remaining computationally comparable. Furthermore, a hardware demonstration validates the effectiveness of the shape controller on a physical system during teleoperation.

17:10-17:15, Paper WeET18.8	Add to My Program
Model-Based Parameter Selection for a Steerable Continuum Robot — Applications to Bronchoalveolar Lavage (BAL)

Rothe, Amber K.	Georgia Institute of Technology
Brumfiel, Timothy A.	Georgia Institute of Technology
Konda, Revanth	Georgia Institute of Technology
Williams, Kirsten	Emory University
Desai, Jaydev P.	Georgia Institute of Technology
Keywords: Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism, Medical Robots and Systems Abstract: Bronchoalveolar lavage (BAL) is a minimally invasive procedure for diagnosing lung infections and diseases. However, navigating tortuous lung anatomy to the distal branches of the bronchoalveolar tree for adequate sampling using BAL remains challenging. Continuum robots have been used to improve the navigation of guidewires, catheters, and endoscopes and could be applied to the BAL procedure as well. One class of continuum robots is constructed from a notched tube and actuated using a tendon. Many tendon-driven notched continuum robots use uniform machining parameters to achieve approximately constant-curvature configurations, which may be unsuitable for traversing the tortuous anatomy of the lungs. This letter presents a model that predicts the curvature of a robot with arbitrary notch shapes subjected to tendon tension. The model predicted the deflection of rectangular, elliptical, and sinusoidal notches in a 0.89 mm diameter nitinol tube with 2.32%, 3.65%, and 6.32% error, respectively. Furthermore, an algorithm is developed to determine the optimal pattern of notches to achieve a desired nonuniform robot curvature. A simulated robot designed using the algorithm achieved the desired shape with a root mean square error (RMSE) of 1.52°. Additionally, we present a model for predicting the shape of nonuniformly notched continuum robots which incorporates friction and pre-curvature. This model predicted the shape of a continuum robot with nonuniform rectangular notches with an average RMSE of 5.20° with respect to the actual robot. We also demonstrated navigating the continuum robot through a pulmonary phantom.


WeET19 Regular Session, 407	Add to My Program
Logistics and Task Planning

Chair: Chirikjian, Gregory	University of Delaware
Co-Chair: Arras, Kai Oliver	University of Stuttgart

16:35-16:40, Paper WeET19.1	Add to My Program
A New Clustering-Based View Planning Method for Building Inspection with Drone

Zheng, Yongshuai	Shandong University
Liu, Guoliang	Shandong University
Ding, Yan	Shandong University
Tian, Guohui	Shandong University
Keywords: Task Planning, Surveillance Robotic Systems, Computational Geometry Abstract: With the rapid development of drone technology, the application of drones equipped with visual sensors for building inspection and surveillance has attracted much attention. View planning aims to find a set of near-optimal viewpoints for vision-related tasks to achieve the vision coverage goal. This paper proposes a new clustering-based two-step computational method using spectral clustering, local potential field method, and hyper-heuristic algorithm to find near-optimal views to cover the target building surface. In the first step, the proposed method generates candidate viewpoints based on spectral clustering and corrects the positions of candidate viewpoints based on our newly proposed local potential field method. In the second step, the optimization problem is converted into a Set Covering Problem (SCP), and the optimal viewpoint subset is solved using our proposed hyper-heuristic algorithm. Experimental results show that the proposed method is able to obtain better solutions with fewer viewpoints and higher coverage.

16:40-16:45, Paper WeET19.2	Add to My Program
Towards the Deployment of an Autonomous Last-Mile Delivery Robot in Urban Areas

Santamaria-Navarro, Angel	Universitat Politècnica De Catalunya
Hernandez Juan, Sergi	CSIC-UPC (IRI)
Herrero Cotarelo, Fernando	IRI, CSIC-UPC
López Gestoso, Alejandro	Institut De Robòtica I Informàtica Industrial
Del Pino, Ivan	Instituto Universitario De Investigación Informática (IUII). Uni
Rodriguez Linares, Nicolás Adrián	Universidad Politécnica De Cataluña
Fernandez, Carlos	Urbiotica
Baldó i Canut, Albert	CARNET Future Mobility Research Hub
Lemardelé, Clément	Universitat Politècnica De Catalunya
Garrell, Anais	UPC-CSIC
Vallvé, Joan	CSIC-UPC
Taher, Hafsa	Institut De Robòtica I Informàtica Industrial, CSIC-UPC
Puig-Pey, Ana	Universitat Politecnica De Catalunya
Pagès, Laia	CARNET
Sanfeliu, Alberto	Universitat Politècnica De Cataluyna
Keywords: Intelligent Transportation Systems, Logistics, Field Robots Abstract: Nowadays, the skyrocketing last-mile freight transportation in urban areas is leading to very negative effects (e.g., pollution, noise or traffic congestion), which could be minimized by using autonomous electric vehicles. In this sense, this paper presents the first prototype of Ona, an autonomous last-mile delivery robot that, in contrast to existing platforms, has a medium-sized storage capacity with the capability of navigating in both street and pedestrian areas. Here, we describe the platform, its main Software modules and the validation experiments, carried out in the Barcelona Robot Lab (Universitat Politècnica de Catalunya); Esplugues de Llobregat (next to Barcelona); and Debrecen (Hungary), which are representative urban scenarios. Apart from robotic technical details, we also include the results of the technology acceptance by the public present in the Esplugues de Llobregat test, collected in situ through a survey.

16:45-16:50, Paper WeET19.3	Add to My Program
Multi-Heuristic Robotic Bin Packing of Regular and Irregular Objects

Nickel, Tim	Fraunhofer IPA
Bormann, Richard	Fraunhofer IPA
Arras, Kai Oliver	University of Stuttgart
Keywords: Logistics, Manipulation Planning, Factory Automation Abstract: The increasing demand in e-commerce, combined with labor shortages and rising wages, is driving the rapid automation of warehouse operations. A critical aspect of this shift is bin packing, where diverse unknown items of varying sizes and shapes must be optimally arranged within a bin or container. Robot bin packing is receiving growing attention and presents unique challenges due to the broad range of objects, packing rules, and task-specific requirements. In response, we propose So-Pack, a generalist packing heuristic for irregularly shaped objects integrated into a flexible, weighted multi-heuristic planning system. The system demonstrates robust performance across general packing scenarios and exhibits the flexibility to adapt to changing packing rules and specific end-user requirements. Experimental results show that the system outperforms state-of-the-art approaches in key metrics in a new challenging dataset of retail objects in real-world applications.

16:50-16:55, Paper WeET19.4	Add to My Program
MultiTalk: Introspective and Extrospective Dialogue for Human-Environment-LLM Alignment

Devarakonda, Venkata Naren	New York University
Kaypak, Ali Umut	New York University
Yuan, Shuaihang	New York University
Krishnamurthy, Prashanth	New York University Tandon School of Engineering
Fang, Yi	New York University
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Task Planning, AI-Enabled Robotics, Manipulation Planning Abstract: LLMs have shown promising results in task planning due to their strong natural language understanding and reasoning capabilities. However, issues such as hallucinations, ambiguities in human instructions, environmental constraints, and limitations in the executing agent’s capabilities often lead to flawed or incomplete plans. This paper proposes MultiTalk, an LLM-based task planning methodology that addresses these issues through a framework of introspective and extrospective dialogue loops. This approach helps ground generated plans in the context of the environment and the agent's capabilities, while also resolving uncertainties and ambiguities in the given task. These loops are enabled by specialized systems designed to extract and predict task-specific states, and flag mismatches or misalignments among the human user, the LLM agent, and the environment. Effective feedback pathways between these systems and the LLM planner foster meaningful dialogue. The efficacy of this methodology is demonstrated through its application to robotic manipulation tasks. Experiments and ablations highlight the robustness and reliability of our method, and comparisons with baselines further illustrate the superiority of MultiTalk in task planning for embodied agents. Project Website: https://llm-multitalk.github.io/

16:55-17:00, Paper WeET19.5	Add to My Program
Goal-Guided Reinforcement Learning: Leveraging Large Language Models for Long-Horizon Task Decomposition

Zhang, Ceng	National University of Singapore
Sun, Zhanhong	National University of Singapore
Chirikjian, Gregory	University of Delaware
Keywords: Task Planning, Reinforcement Learning Abstract: Reinforcement learning (RL) has long struggled with exploration in vast state-action spaces, particularly for intricate tasks that necessitate a series of well-coordinated actions. Meanwhile, large language models (LLMs) equipped with fundamental knowledge have been utilized for task planning across various domains. However, using them to plan for long-term objectives can be demanding, as they function independently from task environments where their knowledge might not be perfectly aligned, hence often overlooking possible physical limitations. To this end, we propose a goal-based RL framework that leverages prior knowledge of LLMs to benefit the training process. We introduce a hierarchical module that features a goal generator to segment a long-horizon task into reachable subgoals and a policy planner to generate action sequences based on the current goal. Subsequently, the policies derived from LLMs guide the RL to achieve each subgoal sequentially. We validate the effectiveness of the proposed framework across different simulation environments and long-horizon tasks with complex state and action spaces.

17:00-17:05, Paper WeET19.6	Add to My Program
Trustworthy Robot Behavior Tree Generation Based on Multi-Source Heterogeneous Knowledge Graph

Yuan, Jianchao	National University of Defense Technology
Yang, Shuo	National University of Defense Technology
Zhang, Qi	National University of Defense Technology
Li, Ge	National University of Defense Technology
Tang, Jianping	National University of Defense Technology
Keywords: Task Planning, Software Architecture for Robotic and Automation Abstract: In robotics, the design of robot behavior trees generally requires roboticists to comprehensively and customizable consider all the relevant factors including the robot hardware capabilities, task descriptions, etc, posing great challenges for design quality and efficiency. The mainstream practice of BT design paradigm has been utilizing the BT component framework to develop task-specific BT structures manually. In contrast, the latest advances in Generative Pretrained Transformers (GPTs) have also opened up the possibility of BT design automation. However, these approaches generally show low efficiency or are less trustworthy for complex robot task goals due to time-consuming manual design and unreliable GPT reasoning. To solve the above limitations, this paper proposes a novel knowledge-driven approach that develops a specialized knowledge graph from multi-sourced and heterogeneous high-quality robot knowledge to reason out a trustworthy robot plan for achieving complex task goals. Then we present the plan transformation and BT merging algorithms to automatically generate the plan-level BT structure. The comparative experiment results have shown that our approach can generate high-quality and trustworthy BT structure regarding the task plan accuracy and consistency, as well as the BT generation time, compared with the manual design and GPT-based approaches.

17:05-17:10, Paper WeET19.7	Add to My Program
Enabling In-Flight Metamorphosis in Multirotors with a Center-Driven Scissor Extendable Airframe for Adaptive Navigation

Yang, Tao	Harbin Institute of Technology, Shenzhen
Li, Peng	Harbin Institute of Technology ShenZhen
Wang, Gang	University of Shanghai for Science and Technology
Shen, Yantao	University of Nevada, Reno
Keywords: Foundations of Automation, Autonomous Vehicle Navigation Abstract: To address complex mission tasks, multirotors benefit from in-flight reconfiguration that enhances their morphological adaptability. This paper presents the Center-Driven Scissor Extendable Airframe (CDSEA), a novel one-degree-of-freedom (DOF) morphing airframe designed to replace traditional fixed-size airframes. The CDSEA allows a quadrotor to achieve significant morphological changes during flight, with rotors deploying radially from a central point. This capability facilitates substantial variations in footprint radius and ensures smooth transitions. The paper details the mechanical design, as well as kinematic and dynamic analyses, and discusses the actuator selection strategy for the CDSEA. Experimental results with a prototype demonstrate that the CDSEA achieves a footprint-radius deformation ratio of 2.5 and a morphing time of 0.3 seconds, surpassing existing solutions. Additionally, the design improves obstacle avoidance and wind resistance. These results underscore the CDSEA's potential as an advanced solution for enhancing UAV adaptive navigation performance in complex environments.


WeET20 Regular Session, 408	Add to My Program
Planning Around People for Social Navigation

Chair: Mendez, Oscar	University of Surrey
Co-Chair: Mavrogiannis, Christoforos	University of Michigan

16:35-16:40, Paper WeET20.1	Add to My Program
SafePCA: Enhancing Autonomous Robot Navigation in Dynamic Crowds Using Proximal Policy Optimization and Cellular Automata

Farouq, Ardiansyah	Telkom University
Tran, Dinh Tuan	College of Information Science and Engineering, Ritsumeikan Univ
Lee, Joo-Ho	Ritsumeikan University
Keywords: Motion and Path Planning, Machine Learning for Robot Control, Localization Abstract: Navigating robots in dynamic environments, such as human crowds, is a major challenge due to the trade-off between performance and robustness. Traditional reinforcement learning methods, such as Proximal Policy Optimization (PPO), have shown strong adaptation capabilities but require extensive training and lack explicit mechanisms for collision avoidance. On the other hand, rule-based approaches, such as the Dynamic Window Approach (DWA), offer computational efficiency but struggle with generalization to unseen crowd behaviors. The proposed SafePCA framework aims to address this trade-off by integrating Cellular Automata (CA) into PPO-based navigation. CA enhances robustness by predicting high-risk areas based on pedestrian movement patterns, reducing unnecessary collisions. However, this approach may lead to conservative behavior, potentially affecting navigation performance in reaching the goal efficiently. The core research question addressed in this work is whether SafePCA can balance these trade-offs to ensure safe yet efficient robot navigation in dynamic crowds. Experiments demonstrate that SafePCA outperforms traditional PPO by providing superior risk assessment and avoidance strategies, achieving optimal performance with fewer training episodes. SafePCA’s real-time adaptability ensures robust navigation in dynamic environments. By leveraging PPO’s adaptive learning and CA’s risk analysis, SafePCA offers an efficient solution for autonomous robot navigation in crowded environments, advancing the field and broadening application possibilities.

16:40-16:45, Paper WeET20.2	Add to My Program
Robot Local Planner: A Periodic Sampling-Based Motion Planner with Minimal Waypoints for Home Environments

Takeshita, Keisuke	Toyota Motor Corporation
Yamazaki, Takahiro	Toyota Motor Corporation
Ono, Tomohiro	Toyota Motor Corporation
Yamamoto, Takashi	Aichi Institute of Technology
Keywords: Motion and Path Planning, Mobile Manipulation, Manipulation Planning Abstract: The objective of this study is to enable fast and safe manipulation tasks in home environments. Specifically, we aim to develop a system that can recognize its surroundings and identify target objects while in motion, enabling it to plan and execute actions accordingly. We propose a periodic sampling-based whole-body trajectory planning method, called the “Robot Local Planner (RLP).” This method leverages unique features of home environments to enhance computational efficiency, motion optimality, and robustness against recognition and control errors, all while ensuring safety. The RLP minimizes computation time by planning with minimal waypoints and generating safe trajectories. Furthermore, overall motion optimality is improved by periodically executing trajectory planning to select more optimal motions. This approach incorporates inverse kinematics that are robust to base position errors, further enhancing robustness. Evaluation experiments demonstrated that the RLP outperformed existing methods in terms of motion planning time, motion duration, and robustness, confirming its effectiveness in home environments. Moreover, application experiments using a tidy-up task achieved high success rates and short operation times, thereby underscoring its practical feasibility.

16:45-16:50, Paper WeET20.3	Add to My Program
Diff-Refiner: Enhancing Multi-Agent Trajectory Prediction with a Plug-And-Play Diffusion Refiner

Zhou, Xiangzheng	Nanjing University of Science and Technology
Chen, Xiaobo	Shandong Technology and Business University
Yang, Jian	Nanjing University of Science & Technology
Keywords: Motion and Path Planning, Planning under Uncertainty Abstract: The inherent stochasticity of the agents’ behavior presents a challenge to trajectory prediction models, which are required to generate multiple plausible future trajectories. Recently, diffusion models have been applied to implement multimodal trajectory prediction. Existing approaches typically employ a standard diffusion process, denoising from a sample drawn from a Gaussian distribution. However, we identify that most agents exhibit an obvious movement trend, rendering many initial denoising steps redundant—primarily transitioning from pure noise to an initial coarse trajectory. To conquer this challenge, this paper innovatively proposes a diffusion refiner that can be used along with existing multi-agent trajectory prediction models to improve their performance. Specifically, we first leverage a baseline model for predicting the coarse future trajectory. Then, the diffusion model is applied as a refiner to reduce the prediction error. Moreover, our method is naturally plug-and-play, allowing convenient integration with existing models. To achieve this, we improve the traditional diffusion process to not only converge towards noise but also the coarse predictions from the baseline model. In such a case, standard step-skipping sampling techniques is inapplicable and we further propose an ordinary differential equation (ODE)-based fast sampling method. Extensive experiments with selected baseline models demonstrate the effectiveness of our approach.

16:50-16:55, Paper WeET20.4	Add to My Program
Scene-Aware Explainable Multimodal Trajectory Prediction

Liu, Pei	The Hong Kong University of Science and Technology(GuangZhou)
Liu, Haipeng	Shanghai Li Auto Co., Ltd
Liu, Xingyu	Shenyang Agricultural University
Li, Yiqun	Southeast University
Chen, Junlan	Monash University
He, Yangfan	University of Minnesota - Twin Cities
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Motion and Path Planning, Computer Vision for Transportation, Robust/Adaptive Control Abstract: Advancements in intelligent technologies have significantly improved navigation in complex traffic environments by enhancing environment perception and trajectory prediction for automated vehicles. However, current research often overlooks the joint reasoning of scenario agents and lacks explainability in trajectory prediction models, limiting their practical use in real-world situations. To address this, we introduce the Explainable Conditional Diffusion-based Multimodal Trajectory Prediction (DMTP) model, which is designed to elucidate the environmental factors influencing predictions and reveal the underlying mechanisms. Our model integrates a modified conditional diffusion approach to capture multimodal trajectory patterns and employs a revised Shapley Value model to assess the significance of global and scenario-specific features. Experiments using the Waymo Open Motion Dataset demonstrate that our explainable model excels in identifying critical inputs and significantly outperforms baseline models in accuracy. Moreover, the factors identified align with the human driving experience, underscoring the model’s effectiveness in learning accurate predictions. Code is available in our open-source repository: https://github. com/ocean-luna/Explainable-Prediction.

16:55-17:00, Paper WeET20.5	Add to My Program
Safety-Critical Traffic Simulation with Adversarial Transfer of Driving Intentions

Huang, Zherui	Shanghai Jiao Tong University
Gao, Xing	Shanghai AI Lab
Zheng, Guanjie	Shanghai Jiaotong University
Wen, Licheng	Shanghai AI Laboratory
Yang, Xuemeng	Shanghai Artificial Intelligence Laboratory
Sun, Xiao	Shanghai AI Laboratory, China
Keywords: Collision Avoidance, Intelligent Transportation Systems, Deep Learning Methods Abstract: Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.

17:00-17:05, Paper WeET20.6	Add to My Program
The Radiance of Neural Fields: Democratizing Photorealistic and Dynamic Robotic Simulation

Alcolado Nuthall, Georgina E	University of Surrey
Bowden, Richard	University of Surrey
Mendez, Oscar	University of Surrey
Keywords: Simulation and Animation, Human-Centered Robotics, Software Tools for Robot Programming Abstract: As robots increasingly coexist with humans, they must navigate complex, dynamic environments rich in visual information and implicit social dynamics, like when to yield or move through crowds. Addressing these challenges requires significant advances in vision-based sensing and a deeper understanding of socio-dynamic factors, particularly in tasks like navigation. To facilitate this, robotics researchers need advanced simulation platforms offering dynamic, photorealistic environments with realistic actors. Unfortunately, most existing simulators fall short, prioritizing geometric accuracy over visual fidelity, and employing unrealistic agents with fixed trajectories and low-quality visuals. To overcome these limitations, we developed a simulator that incorporates three essential elements: (1) photorealistic neural rendering of environments, (2) neurally animated human entities with behavior management, and (3) an ego-centric robotic agent providing multi-sensor output. By utilizing advanced neural rendering techniques in a dual-NeRF simulator, our system produces high-fidelity, photorealistic renderings of both environments and human entities. Additionally, it integrates a state-of-the-art Social Force Model to model dynamic human-human and human-robot interactions, creating the first photorealistic and accessible human-robot simulation system powered by neural rendering.

17:05-17:10, Paper WeET20.7	Add to My Program
Human-Robot Cooperative Distribution Coupling for Hamiltonian-Constrained Social Navigation

Wang, Weizheng	Purdue University
Yu, Chao	Tsinghua University
Wang, Yu	Tsinghua University
Min, Byung-Cheol	Purdue University
Keywords: Motion and Path Planning, Acceptability and Trust, Deep Learning Methods Abstract: Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate spatial-temporal environmental dynamics understanding and port-Hamiltonian physical interactive process construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability.

17:10-17:15, Paper WeET20.8	Add to My Program
Crowd Perception Communication-Based Multi-Agent Path Finding with Imitation Learning

Xie, Jing	National Innovation Institute of Defense Technology
Zhang, Yongjun	National Innovation Institute of Defense Technology
Yang, Huanhuan	National University of Defense Technology
Ouyang, Qianying	Intelligent Game and Decision Lab; Tianjin Artificial Intelliegnc
Dong, Fang	College of Computer, National University of Defense Technology
Guo, Xinyu	Beijing Institute of Technology
Jin, Songchang	Defense Innovation Institute
Shi, Dianxi	Defense Innovation Institute
Keywords: Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning Abstract: Deep reinforcement learning-based Multi-Agent Path Finding (MAPF) has gained significant attention due to its remarkable adaptability to environments. Existing methods primarily leverage multi-agent communication in a fully-decentralized framework to maintain scalability while enhancing information exchange among agents. However, as the number of agents and obstacles increases, the environment becomes more complex, making cooperation between agents becomes more difficult, and crowding occurs from time to time. To address these issues, we propose a decentralized planner C3PIL, which integrates a Controlled Communication mechanism for Crowd Perception and uses Imitation Learning to improve policy learning. C3PIL first introduces a crowd perception communication module that perceives environmental crowd information and incorporates it into the controlled communication. This effectively prevents and mitigates crowded situations. Furthermore, we employ generative adversarial imitation learning to learn a reward function from expert experiences. It reduces the possible misleading caused by the fixed reward function, improves the flexibility and diversity of agent behaviors, and ultimately enables agents to cooperate effectively. Finally, experimental results show that C3PIL not only outperforms previous learning-based MAPF methods, but also further enhances the cooperation of agents and significantly reduces crowding in complex environments. The code is available at https://github.com/JJingXie/C3PIL.


WeET21 Regular Session, 410	Add to My Program
Integrating Motion Planning and Learning 2

Chair: Manocha, Dinesh	University of Maryland
Co-Chair: Soh, Harold	National University of Singapore

16:35-16:40, Paper WeET21.1	Add to My Program
TSPDiffuser: Diffusion Models As Learned Samplers for Traveling Salesperson Path Planning Problems

Yonetani, Ryo	CyberAgent
Keywords: Integrated Planning and Learning, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: This paper presents TSPDiffuser, a novel data-driven path planner for traveling salesperson path planning problems (TSPPPs) in environments rich with obstacles. Given a set of destinations within obstacle maps, our objective is to efficiently find the shortest possible collision-free path that visits all the destinations. In TSPDiffuser, we train a diffusion model on a large collection of TSPPP instances and their respective solutions to generate plausible paths for unseen problem instances. The model can then be employed as a learned sampler to construct a roadmap that contains potential solutions with a small number of nodes and edges. This approach enables efficient and accurate estimation of travel costs between destinations, effectively addressing the primary computational challenge in solving TSPPPs. Experimental evaluations with diverse synthetic and real-world indoor/outdoor environments demonstrate the effectiveness of TSPDiffuser over existing methods in terms of the trade-off between solution quality and computational time requirements.

16:40-16:45, Paper WeET21.2	Add to My Program
Anticipatory Planning for Performant Long-Lived Robot in Large-Scale Home-Like Environments

Talukder, Md Ridwan Hossain	George Mason University
Arnob, Raihan Islam	George Mason University
Stein, Gregory	George Mason University
Keywords: Integrated Planning and Learning, Task Planning Abstract: We consider the setting where a robot must complete a sequence of tasks in a persistent large-scale environment, given one at a time. Existing task planners often operate myopically, focusing solely on immediate goals without considering the impact of current actions on future tasks. Anticipatory planning, which reduces the joint objective of the immediate planning cost of the current task and the expected cost associated with future subsequent tasks, offers an approach for improving long-lived task planning. However, applying anticipatory planning in large-scale environments presents significant challenges due to the sheer number of assets involved, which strains the scalability of learning and planning. In this research, we introduce a model-based anticipatory task planning framework designed to scale to large-scale realistic environments. Our framework uses a graph neural network (GNN) in particular via a representation inspired by a 3D scene graph to learn the essential properties of the environment crucial to estimating the state's expected cost and a sampling-based procedure for practical large-scale anticipatory planning. Our experimental results show that our planner reduces the cost of task sequence by 5.38% in home and 31.5% in restaurant settings. If given time to prepare in advance using our model reduces task sequence costs by 40.6% and 42.5%, respectively.

16:45-16:50, Paper WeET21.3	Add to My Program
Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotics Manipulation

Zhu, MinJie	East China Normal University
Zhu, Yichen	Midea Group
Li, Jinming	Shanghai University
Wen, Junjie	East China Normal University
Xu, Zhiyuan	Midea Group
Liu, Ning	Midea Group
Cheng, Ran	Midea Robozone
Shen, Chaomin	East China Normal University
Peng, Yaxin	Shanghai University
Feng, Feifei	Midea Group
Tang, Jian	Midea Group (Shanghai) Co., Ltd
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely textbf{methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our unmasking strategy allows the policy network to enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named methodname, can effectively scale up the model size with improved performance and generalization. We benchmark methodname~across 50 different tasks from MetaWorld and find that our largest methodname~outperforms DP~with an average improvement of 21.6%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 22.5% over DP-T on four single-arm tasks and 66.7% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at href{scaling-diffusion-policy.github.io/}{https://scaling- diffusion-policy.github.io/}.

16:50-16:55, Paper WeET21.4	Add to My Program
Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion

Huang, Zixuan	University of Michigan
He, Yinong	University of Michigan
Lin, Yating	University of Michigan
Berenson, Dmitry	University of Michigan
Keywords: Deep Learning in Grasping and Manipulation, Integrated Planning and Learning Abstract: Long-horizon contact-rich manipulation has long been a challenging problem, as it requires reasoning over both discrete contact modes and continuous object motion. We introduce Implicit Contact Diffuser (ICD), a diffusion-based model that generates a sequence of neural descriptors that specify a series of contact relationships between the object and the environment. This sequence is then used as guidance for an MPC method to accomplish a given task. The key advantage of this approach is that the latent descriptors provide more task- relevant guidance to MPC, helping to avoid local minima for contact-rich manipulation tasks. Our experiments demonstrate that ICD outperforms baselines on complex, long-horizon, contact-rich manipulation tasks, such as cable routing and notebook folding. Additionally, our experiments also indicate that ICD can generalize a target contact relationship to a different environment.

16:55-17:00, Paper WeET21.5	Add to My Program
Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks

Feng, Zeyu	National University of Singapore
Luan, Hao	National University of Singapore
Ma, Kevin Yuchen	National University of Singapore
Soh, Harold	National University of Singapore
Keywords: Reinforcement Learning, Learning from Demonstration, Hybrid Logical/Dynamical Planning and Verification Abstract: Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation.

17:00-17:05, Paper WeET21.6	Add to My Program
PRESTO: Fast Motion Planning Using Diffusion Models Based on Key-Configuration Environment Representation

Seo, Mingyo	The University of Texas at Austin
Cho, Yoonyoung	KAIST
Sung, Yoonchang	The University of Texas at Austin
Stone, Peter	University of Texas at Austin
Zhu, Yuke	The University of Texas at Austin
Kim, Beomjoon	Korea Advanced Institute of Science and Technology
Keywords: Motion and Path Planning, Collision Avoidance, Integrated Planning and Learning Abstract: We introduce a learning-guided motion planning framework that generates seed trajectories using a diffusion model for trajectory optimization. Given a workspace, our method approximates the configuration space (C-space) obstacles through an environment representation consisting of a sparse set of task-related key configurations, which is then used as a conditioning input to the diffusion model. The diffusion model integrates regularization terms that encourage smooth, collision-free trajectories during training, and trajectory optimization refines the generated seed trajectories to correct any colliding segments. Our experimental results demonstrate that high-quality trajectory priors, learned through our C-space-grounded diffusion model, enable the efficient generation of collision-free trajectories in narrow-passage environments, outperforming previous learning- and planning-based baselines. Videos and additional materials can be found on the project page: https://kiwi-sherbet.github.io/PRESTO.

17:05-17:10, Paper WeET21.7	Add to My Program
Demonstration Data-Driven Parameter Adjustment for Trajectory Planning in Highly Constrained Environments

Lu, Wangtao	Zhejiang University
Chen, Lei	Beijing Institute of Spacecraft System Engineering
Wang, Yunkai	Zhejiang University
Wei, Yufei	Zhejiang University
Wu, Zifei	Zhejiang University
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Motion and Path Planning, Learning from Demonstration Abstract: Trajectory planning in highly constrained environments is crucial for robotic navigation. Classical algorithms are widely used for their interpretability, generalization, and system robustness. However, these algorithms often require parameter retuning when adapting to new scenarios. To address this issue, we propose a demonstration data-driven reinforcement learning (RL) method for automatic parameter adjustment. Our approach includes two main components: a front-end policy network and a back-end asynchronous controller. The policy network selects appropriate parameters for the trajectory planner, while a discriminator in a Conditional Generative Adversarial Network (CGAN) evaluates the planned trajectory, using this evaluation as an imitation reward in RL. The asynchronous controller is employed for high-frequency trajectory tracking. Experiments conducted in both simulation and realworld demonstrate that our proposed method significantly enhances the performance of classical algorithms.

17:10-17:15, Paper WeET21.8	Add to My Program
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring Using Vision-Language Models

Song, Daeun	George Mason University
Liang, Jing	University of Maryland
Payandeh, Amirreza	George Mason University
Raj, Amir Hossain	George Mason University
Xiao, Xuesu	George Mason University
Manocha, Dinesh	University of Maryland
Keywords: Motion and Path Planning, Task and Motion Planning, Integrated Planning and Control Abstract: We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.


WeET22 Regular Session, 411	Add to My Program
Deep Learning for Visual Perception 3

Chair: Le, Ngan	University of Arkansas
Co-Chair: Hsu, Winston	National Taiwan University

16:35-16:40, Paper WeET22.1	Add to My Program
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension

Guan, Runwei	University of Liverpool
Zhang, Ruixiao	University of Southampton
Ouyang, Ningwei	University of Liverpool
Liu, Jianan	Momoni AI
Man, Ka Lok	Xi'an Jiaotong-Liverpool University
Cai, Xiaohao	University of Southampton
Xu, Ming	Xi'an Jiaotong-Liverpool University
Smith, Jeremy S.	University of Liverpool
Lim, Eng Gee	Xi'an Jiaotong-Liverpool University
Yue, Yutao	Hong Kong University of Science and Technology (Guangzhou)
Xiong, Hui	The Hong Kong University of Science and Technology (Guangzhou)
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Intelligent Transportation Systems Abstract: Embodied perception is essential for intelligent vehicles and robots in interactive environmental understanding. However, these advancements primarily focus on vision, with limited attention given to using 3D modeling sensors, restricting a comprehensive understanding of objects in response to prompts containing qualitative and quantitative queries. Recently, as a promising automotive sensor with affordable cost, 4D millimeter-wave radars provide denser point clouds than conventional radars and perceive both semantic and physical characteristics of objects, thereby enhancing the reliability of perception systems. To foster the development of natural language-driven context understanding in radar scenes for 3D visual grounding, we construct the first dataset, Talk2Radar, which bridges these two modalities for 3D Referring Expression Comprehension (REC). Talk2Radar contains 8,682 referring prompt samples with 20,558 referred objects. Moreover, we propose a novel model, T-RadarNet, for 3D REC on point clouds, achieving State-Of-The-Art (SOTA) performance on the Talk2Radar dataset compared to counterparts. Deformable-FPN and Gated Graph Fusion are meticulously designed for efficient point cloud feature modeling and cross-modal fusion between radar and text features, respectively. Comprehensive experiments provide deep insights into radar-based 3D REC. We release our project at https://github.com/GuanRunwei/Talk2Radar.

16:40-16:45, Paper WeET22.2	Add to My Program
Improving Generalization Ability for 3D Object Detection by Learning Sparsity-Invariant Features

Lu, Hsin-Cheng	National Taiwan University
Lin, Chungyi	National Taiwan University
Hsu, Winston	National Taiwan University
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Visual Learning Abstract: In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them—they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data. The code is available at https://github.com/Tiffamy/3DOD-LSF.

16:45-16:50, Paper WeET22.3	Add to My Program
Camera-Lidar Consistent Neural Radiance Fields

Hou, Chao	The University of Hong Kong
Zhang, Fu	University of Hong Kong
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Sensor Fusion Abstract: Neural Radiance Fields (NeRFs) have become a leading technique for novel view synthesis, with promising applications in robotics. However, due to shape-radiance ambiguity, NeRFs often require additional depth inputs for regularization in outdoor scenarios. LiDAR provides accurate depth measurements, but current methods typically combine only a few frames, resulting in sparse depth maps and discrepancies with camera images. The asynchronous nature of LiDAR, where each point is captured at a different timestamp, introduces depth inaccuracies when treated as simultaneous. These errors, along with inherent LiDAR noise, create inconsistencies that hinder reconstruction accuracy. To address these challenges, we propose a continuous-time framework for joint Camera- LiDAR optimization, enabling more consistent radiance field reconstruction and improving both view synthesis and geometric accuracy. To address these issues, we introduce a continuoustime framework for joint Camera-LiDAR optimization, aiming to consistently reconstruct the radiance field for better view synthesis and geometric accuracy.

16:50-16:55, Paper WeET22.4	Add to My Program
Iterative Volume Fusion for Asymmetric Stereo Matching

Gao, Yuanting	Tsinghua University
Shen, Linghao	Sony (China) Ltd
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, AI-Based Methods Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

16:55-17:00, Paper WeET22.5	Add to My Program
OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity

Wang, Junming	The University of Hong Kong
Yin, Wei	University of Adelaide
Long, Xiaoxiao	The University of Hong Kong
Zhang, Xingyu	Horizon Robotics
Xing, Zebin	UCAS
Guo, Xiaoyang	Horizon Robotics
Zhang, Qian	Horizon Robotics
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Automation Abstract: 3D semantic occupancy prediction networks have demonstrated remarkable capabilities in reconstructing the geometric and semantic structure of 3D scenes, providing crucial information for robot navigation and autonomous driving systems. However, due to their large overhead from dense network structure designs, existing networks face challenges balancing accuracy and latency. In this paper, we introduce OccRWKV, an efficient semantic occupancy network inspired by Receptance Weighted Key Value (RWKV). OccRWKV separates semantics, occupancy prediction, and feature fusion into distinct branches, each incorporating Sem-RWKV and Geo-RWKV blocks. These blocks are designed to capture long-range dependencies, enabling the network to learn domain-specific representation (i.e., semantics and geometry), which enhances prediction accuracy. Leveraging the sparse nature of real-world 3D occupancy, we reduce computational overhead by projecting features into the bird's-eye view (BEV) space and propose a BEV-RWKV block for efficient feature enhancement and fusion. This enables real-time inference at 22.2 FPS without compromising performance. Experiments demonstrate that OccRWKV outperforms the state-of-the-art methods on the SemanticKITTI dataset, achieving a mIoU of 25.1 while being 20 times faster than the best baseline, Co-Occ, making it suitable for real-time deployment on robots to enhance autonomous navigation efficiency. Code and video are available on our project page: https://jmwang0117.github.io/OccRWKV/.

17:00-17:05, Paper WeET22.6	Add to My Program
ZSORN: Language-Driven Object-Centric Zero-Shot Object Retrieval and Navigation

Guan, Tianrui	University of Maryland
Yang, Yurou	Amazon
Cheng, Harry	Amazon
Lin, Muyuan	Amazon.com LLC
Kim, Richard	Amazon, Lab126
Madhivanan, Rajasimman	Amazon.com
Sen, Arnab	Amazon
Manocha, Dinesh	University of Maryland
Keywords: Deep Learning for Visual Perception, Vision-Based Navigation Abstract: In this paper, we present ZSORN, a novel language-driven object-centric image representation for object retrieval and navigation task within complex scenes. We propose an object centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

17:05-17:10, Paper WeET22.7	Add to My Program
PRIDEV: A Plug-And-Play Refinement for Improved Depth Estimation in Videos

Xu, Jing	Peking University
Liu, Hong	Peking University
Wu, Jianbing	Peking University
Xu, Xinhua	Peking University
Keywords: RGB-D Perception, Deep Learning for Visual Perception Abstract: Monocular video depth estimation is a key challenge in computer vision, highlighting its importance in visual understanding. Monocular depth estimation models trained on single images achieve impressive results on individual frames but often lack temporal consistency when applied to videos, leading to flickering and artifacts. Current video depth estimation methods often rely on additional optical flow or camera poses, which are limited by their accuracy, carefullydesigned, and lack robustness. Specially, we propose a plug-and-play method that seamlessly transfers the robustness of image depth estimation to video depth estimation. By leveraging powerful priors from image depth estimation, our method enhances the performance of video depth estimation without requiring additional conditional inputs or extensive pretraining on large and expensive video datasets. We introduce the Temporal Depth Stabilization Module (TDSM), which can seamlessly inflate an image monocular depth estimation model into a video depth estimation model, enabling unified modeling of depth across video sequences and capturing the temporal cues in video. We validate the effectiveness and efficiency of our method across various datasets (e.g., normal and challenging conditions) and different backbones. Extensive experiments demonstrate that our simple and effective method significantly improves monocular depth estimation networks, achieving new state-of-the-art accuracy in both spatial and temporal dimensions.


WeET23 Regular Session, 412	Add to My Program
Learning for Control

Chair: Manchester, Zachary	Carnegie Mellon University
Co-Chair: Beckers, Thomas	Vanderbilt University

16:35-16:40, Paper WeET23.1	Add to My Program
Unsupervised Meta-Testing with Conditional Neural Processes for Hybrid Meta-Reinforcement Learning

Ada, Suzan Ece	Bogazici University
Ugur, Emre	Bogazici University
Keywords: Reinforcement Learning, Deep Learning Methods, Machine Learning for Robot Control Abstract: We introduce Unsupervised Meta-Testing with Conditional Neural Processes (UMCNP), a novel hybrid few-shot meta-reinforcement learning (meta-RL) method that uniquely combines, yet distinctly separates, parameterized policy gradient-based (PPG) and task inference-based few-shot meta-RL. Tailored for settings where the reward signal is missing during meta-testing, our method increases sample efficiency without requiring additional samples in meta-training. UMCNP leverages the efficiency and scalability of Conditional Neural Processes (CNPs) to reduce the number of online interactions required in meta-testing. During meta-training, samples previously collected through PPG meta-RL are efficiently reused for learning task inference in an offline manner. UMCNP infers the latent representation of the transition dynamics model from a single test task rollout with unknown parameters. This approach allows us to generate rollouts for self-adaptation by interacting with the learned dynamics model. We demonstrate our method can adapt to an unseen test task using significantly fewer samples during meta-testing than the baselines in 2D-Point Agent and continuous control meta-RL benchmarks, namely, cartpole with unknown angle sensor bias, walker agent with randomized dynamics parameters.

16:40-16:45, Paper WeET23.2	Add to My Program
Efficient Online Learning of Contact Force Models for Connector Insertion

Tracy, Kevin	Carnegie Mellon University
Manchester, Zachary	Carnegie Mellon University
Jain, Ajinkya	Intrinsic Innovation LLC
Go, Keegan	Intrinsic Innovation LLC
Schaal, Stefan	Google X
Erez, Tom	Google
Tassa, Yuval	University of Washington
Keywords: Model Learning for Control, Calibration and Identification, Dexterous Manipulation Abstract: Contact-rich manipulation tasks with stiff frictional elements, like connector insertion, are difficult to model with rigid-body simulators. In this work, we propose a new approach for modeling these environments by learning a quasi-static contact force model instead of a full simulator. Using a feature vector that contains information about the configuration and control, we find a linear mapping adequately captures the relationship between this feature vector and the sensed contact forces. A novel Linear Model Learning (LML) algorithm is used to solve for the globally optimal mapping in real time without any matrix inversions, resulting in an algorithm that runs in nearly constant time on a GPU as the model size increases. We validate the proposed approach for connector insertion in both simulation and hardware experiments, where the learned model is combined with an optimization-based impedance controller to achieve smooth insertions in the presence of misalignments and uncertainty. Our website featuring videos, code, and more materials is available at https://model-based-plugging.github.io/.

16:45-16:50, Paper WeET23.3	Add to My Program
Flying Quadrotors in Tight Formations Using Learning-Based Model Predictive Control

Chee, Kong Yao	University of Pennsylvania
Hsieh, Pei-An	University of Pennsylvania
Pappas, George J.	University of Pennsylvania
Hsieh, M. Ani	University of Pennsylvania
Keywords: Model Learning for Control, Machine Learning for Robot Control, Aerial Systems: Mechanics and Control Abstract: Flying quadrotors in tight formations is a challenging problem. It is known that in the near-field airflow of a quadrotor, the aerodynamic effects induced by the propellers are complex and difficult to characterize. Although machine learning tools can potentially be used to derive models that capture these effects, these data-driven approaches can be sample inefficient and the resulting models often do not generalize as well as their first-principles counterparts. In this work, we propose a framework that combines the benefits of first-principles modeling and data-driven approaches to construct an accurate and sample efficient representation of the complex aerodynamic effects resulting from quadrotors flying in formation. The data-driven component within our model is lightweight, making it amenable for optimization-based control design. Through simulations and physical experiments, we show that incorporating the model into a novel learning-based nonlinear model predictive control (MPC) framework results in substantial performance improvements in terms of trajectory tracking and disturbance rejection. In particular, our framework significantly outperforms nominal MPC in physical experiments, achieving a 40.1% improvement in the average trajectory tracking errors and a 57.5% reduction in the maximum vertical separation errors. Our framework also achieves exceptional sample efficiency, using only a total of 46 seconds of flight data for training across both simulations and physical experiments. Furthermore, with our proposed framework, the quadrotors achieve an exceptionally tight formation, flying with an average separation of less than 1.5 body lengths throughout the flight.

16:50-16:55, Paper WeET23.4	Add to My Program
Learning Based MPC for Autonomous Driving Using a Low Dimensional Residual Model

Li, Yaoyu	Tsinghua University
Huang, Chaosheng	Tsinghua University
Yang, Dongsheng	BYD Automotive New Technology Research Institute
Liu, Wenbo	Tsinghua University
Li, Jun	Tsinghua University
Keywords: Model Learning for Control, Machine Learning for Robot Control, Motion Control Abstract: In this paper, a learning based Model Predictive Control (MPC) using a low dimensional residual model is proposed for autonomous driving. One of the critical challenge in autonomous driving is the complexity of vehicle dynamics, which impedes the formulation of accurate vehicle model. Inaccurate vehicle model can significantly impact the performance of MPC controller. To address this issue, this paper decomposes the nominal vehicle model into invariable and variable elements. The accuracy of invariable elements are ensured by calibration, while the deviations in the variable elements are learned by a low-dimensional residual model. The features of residual model are selected as the physical variables most correlated with nominal model errors. Physical constraints among these features are formulated to explicitly define the valid region within the feature space. The formulated model and constraints are incorporated into the MPC framework and validated through both simulation and real vehicle experiments. The results indicate that the proposed method significantly enhances the model accuracy and controller performance.

16:55-17:00, Paper WeET23.5	Add to My Program
Modeling of Deformable Linear Objects under Incomplete State Information

Klankers, Marc Kilian	Technische Universität Braunschweig
Steil, Jochen J.	Technische Universität Braunschweig
Keywords: Model Learning for Control, Machine Learning for Robot Control, Modeling, Control, and Learning for Soft Robots Abstract: The robot-based tracking of highly dynamic end point motions of deformable linear objects (DLO) remains challenging due to its non-linear behavior. Since simple feedback control is infeasible, model-based control offers potential to account for the non-linear effects, but requires computation efficient and accurate models. Promising results have been achieved utilizing data-driven models that introduce a latent kinematic chain as model of the DLO and mapping measurements of the tip position in its latent joint space, in which the dynamic motion model is learned. So far, this approach has the limitation that it can not handle situations of incomplete sensory information, for instance if occlusion occurs. Consequently, this paper introduces a fusion network architecture capable of making predictions even if sensory information is incomplete. We achieve additional state estimation of the latent joint state by learning a data driven inverse kinematics with help of wrench measurements at the DLO base and evaluate our approach by simulating occlusion. We demonstrate the computational effectiveness of our approach for in the loop control tasks.

17:00-17:05, Paper WeET23.6	Add to My Program
Impedance Primitive-Augmented Hierarchical Reinforcement Learning for Sequential Tasks

Berjaoui Tahmaz, Amin	TU Delft
Prakash, Ravi	Indian Institute of Science
Kober, Jens	TU Delft
Keywords: Reinforcement Learning, Compliance and Impedance Control, Task and Motion Planning Abstract: This paper presents an Impedance Primitive-augmented hierarchical reinforcement learning framework for efficient robotic manipulation in sequential contact tasks. We leverage this hierarchical structure to sequentially execute behavior primitives with variable stiffness control capabilities for contact tasks. Our proposed approach relies on three key components: an action space enabling variable stiffness control, an adaptive stiffness controller for dynamic stiffness adjustments during primitive execution, and affordance coupling for efficient exploration while encouraging compliance. Through comprehensive training and evaluation, our framework learns efficient stiffness control capabilities and demonstrates improvements in learning efficiency, compositionality in primitive selection, and success rates compared to the state-of-the-art. The training environments include block lifting, door opening, object pushing, and surface cleaning. Real world evaluations further confirm the framework's sim2real capability. This work lays the foundation for more adaptive and versatile robotic manipulation systems, with potential applications in more complex contact-based tasks.

17:05-17:10, Paper WeET23.7	Add to My Program
Plug-And-Play Physics-Informed Learning Using Uncertainty Quantified Port-Hamiltonian Models

Tan, Kaiyuan	Washington University in St.Louis
Li, Peilun	Vanderbilt University
Wang, Jun	Washington University in St. Louis
Beckers, Thomas	Vanderbilt University
Keywords: Model Learning for Control, AI-Based Methods, Calibration and Identification Abstract: The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations relative to the training data. In this paper, we introduce a Plug-and-Play Physics-Informed Machine Learning (PnP-PIML) framework to address this challenge. Our method employs conformal prediction to identify outlier dynamics and, in that case, switches from a nominal predictor to a physics-consistent model, namely distributed Port-Hamiltonian systems (dPHS). We leverage Gaussian processes to model the energy function of the dPHS, enabling not only the learning of system dynamics but also the quantification of predictive uncertainty through its Bayesian nature. In this way, the proposed framework produces reliable physics-informed predictions even for the out-of-distribution scenarios.

17:10-17:15, Paper WeET23.8	Add to My Program
Robust Proximal Adversarial Reinforcement Learning under Model Mismatch

Zhai, Peng	Fudan University
Wei, Xiaoyi	Fudan University
Hou, Taixian	FuDan University
Ji, Xiaopeng	Zhejiang University
Dong, Zhiyan	Fudan University
Yi, Jiafu	Hainan University
ZHang, Lihua	Fudan University
Keywords: Reinforcement Learning, Robust/Adaptive Control Abstract: Reinforcement learning (RL) can generate high-performance control policies for complex tasks in simulation through an end-to-end approach. However, the RL policy is not robust to uncertainties caused by modeling mismatch between simulation and real environments, making it difficult to transfer to the real world. In response to the above challenge, this letter introduces a lightweight and efficient robust RL algorithm. The algorithm transforms the optimization objective of the adversary from a long-term cumulative reward to a short-term reward, making the adversary focus on the performance in the near future. Additionally, the adversarial actions are projected onto a finite subset within the perturbation space using projected gradient descent, effectively constraining the adversary's strength and obtaining more robust policies. Extensive experiments in both simulated and real environments show that our algorithm improves the generalization ability of the policy for the modeling mismatch, outperforming the next best prior methods across almost all environments.

Technical Program for Wednesday May 21, 2025