ICRA 2025 Program | Tuesday May 20, 2025


TuAT1 Regular Session, 302	Add to My Program
Award Finalists 1

Chair: Walter, Matthew	Toyota Technological Institute at Chicago
Co-Chair: Corke, Peter	Queensland University of Technology

09:55-10:00, Paper TuAT1.1	Add to My Program
Achieving Human Level Competitive Robot Table Tennis

D'Ambrosio, David	Google
Abeyruwan, Saminda Wishwajith	Google Inc
Graesser, Laura	Google
Iscen, Atil	Google
Ben Amor, Heni	Arizona State University
Bewley, Alex	Google
Reed, Barney J.	Stickman Skills Center LLC
Reymann, Krista	Google Research
Takayama, Leila	University of California, Santa Cruz
Tassa, Yuval	University of Washington
Choromanski, Krzysztof	Google DeepMind Robotics
Coumans, Erwin	Google Inc
Jain, Deepali	Robotics at Google
Jaitly, Navdeep	Google Research
Jaques, Natasha	Google
Kataoka, Satoshi	Google LLC
Kuang, Yuheng	Google DeepMind
Lazic, Nevena	Deepmind
Mahjourian, Reza	Waymo
Moore, Sherry	Google DeepMind
Oslund, Kenneth	Google
Shankar, Anish	Google
Sindhwani, Vikas	Google Brain, NYC
Vanhoucke, Vincent	Google
Vesom, Grace	Google DeepMind
Xu, Peng	Google
Sanketi, Pannag	Google
Keywords: Reinforcement Learning, Deep Learning Methods, Physical Human-Robot Interaction Abstract: Achieving human-level performance on real world tasks is a north star for the robotics community. We present the first learned robot agent that reaches amateur human-level performance in competitive table tennis. Table tennis is a physically demanding sport that takes humans years to master. We contribute (1) a hierarchical and modular policy architecture consisting of (i) low level controllers with their skill descriptors that model their capabilities and (ii) a high level controller that chooses the low level skills, (2) techniques for enabling zero-shot sim-to-real and curriculum building, including an iterative approach (train in sim, deploy in real), and (3) real time adaptation to unseen opponents. Policy performance was assessed through 29 robot vs. human matches of which the robot won 45% (13/29). All humans were unseen players and their skill level varied from beginner to tournament level. Whilst the robot lost all matches vs. the most advanced players it won 100% matches vs. beginners and 55% matches vs. intermediate players, demonstrating solidly amateur human-level performance.

10:00-10:05, Paper TuAT1.2	Add to My Program
Robo-DM: Data Management for Large Robot Datasets

Chen, Kaiyuan	University of California, Berkeley
Fu, Letian	UC Berkeley
Huang, David	University of California, Berkeley
Zhang, Yanxiang	University of California, Berkeley
Chen, Lawrence Yunliang	UC Berkeley
Huang, Huang	University of California at Berkeley
Hari, Kush	UC Berkeley
Balakrishna, Ashwin	Toyota Research Institute
Xiao, Ted	Google DeepMind
Sanketi, Pannag	Google
Kubiatowicz, John	UC Berkeley
Goldberg, Ken	UC Berkeley
Keywords: Big Data in Robotics and Automation, Methods and Tools for Robot System Design, Engineering for Robotic Systems Abstract: Recent work suggests that very large datasets of teleoperated robot demonstrations can train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM reduces the size of robot trajectory data, transfer costs, and data load time during training. In particular, compared to the RLDS format used in OXE datasets, Robo-DM’s compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster. In fine-tuning Octo, a transformer-based robot policy with 73k episodes with RT-1 data, Robo-DM does not incur any loss in training performance. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer any reduction in downstream task accuracy. Code and evaluation scripts can be found on the website.

10:05-10:10, Paper TuAT1.3	Add to My Program
No Plan but Everything under Control: Robustly Solving Sequential Tasks with Dynamically Composed Gradient Descent

Mengers, Vito	Technische Universität Berlin
Brock, Oliver	Technische Universität Berlin
Keywords: Integrated Planning and Control, Reactive and Sensor-Based Planning, Optimization and Optimal Control Abstract: We introduce a novel gradient-based approach for solving sequential tasks by dynamically adjusting the underlying myopic potential field in response to feedback and the world's regularities. This adjustment implicitly considers subgoals encoded in these regularities, enabling the solution of long sequential tasks, as demonstrated by solving the traditional planning domain of Blocks World—without any planning. Unlike conventional planning methods, our feedback-driven approach adapts to uncertain and dynamic environments, as demonstrated by one hundred real-world trials involving drawer manipulation. These experiments highlight the robustness of our method compared to planning and show how interactive perception and error recovery naturally emerge from gradient descent without explicitly implementing them. This offers a computationally efficient alternative to planning for a variety of sequential tasks, while aligning with observations on biological problem-solving strategies.

10:10-10:15, Paper TuAT1.4	Add to My Program
MiniVLN: Efficient Vision-And-Language Navigation by Progressive Knowledge Distillation

Zhu, Junyou	University of Chinese Academy of Sciences
Qiao, Yanyuan	The University of Adelaide
Zhang, Siqi	Tongji University
He, Xingjian	Institute of Automation Chinese Academy of Sciences
Wu, Qi	University of Adelaide
Liu, Jing	Institute of Automation, Chinese Academy of Science
Keywords: Deep Learning Methods, Transfer Learning Abstract: In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.

10:15-10:20, Paper TuAT1.5	Add to My Program
PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-Rich Manipulation Using Tactile-Diffusion Policies

Zhao, Jialiang	Massachusetts Institute of Technology
Kuppuswamy, Naveen	Toyota Research Institute
Feng, Siyuan	Toyota Research Institute
Burchfiel, Benjamin	Toyota Research Institute
Adelson, Edward	MIT
Keywords: Force and Tactile Sensing, Sensorimotor Learning, Learning from Demonstration Abstract: Achieving robust dexterous manipulation in unstructured domestic environments remains a significant challenge in robotics. Even with state-of-the-art robot learning methods, haptic-oblivious control strategies (i.e. those relying only on external vision and/or proprioception) often fall short due to occlusions, visual complexities, and the need for precise contact interaction control. To address these limitations, we introduce PolyTouch, a novel robot finger that integrates camera-based tactile sensing, acoustic sensing, and peripheral visual sensing into a single design that is compact and durable. PolyTouch provides high-resolution tactile feedback across multiple temporal scales, which is essential for efficiently learning complex manipulation tasks. Experiments demonstrate an at least 20-fold increase in lifespan over commercial tactile sensors, with a design that is both easy to manufacture and scalable. We then use this multi-modal tactile feedback along with visuo-proprioceptive observations to synthesize a tactile-diffusion policy from human demonstrations; the resulting contact-aware control policy significantly outperforms haptic-oblivious policies in multiple contact-aware manipulation policies. This paper highlights how effectively integrating multi-modal contact sensing can hasten the development of effective contact-aware manipulation policies, paving the way for more reliable and versatile domestic robots. More information can be found at https://polytouch.alanz.info.

10:20-10:25, Paper TuAT1.6	Add to My Program
A New Stereo Fisheye Event Camera for Fast Drone Detection and Tracking

Rodrigues Da Costa, Daniel	Université De Picardie Jules Verne
Robic, Maxime	Université De Picardie Jules Verne
Vasseur, Pascal	Université De Picardie Jules Verne
Morbidi, Fabio	Université De Picardie Jules Verne
Keywords: Omnidirectional Vision, Visual Tracking, Aerial Systems: Applications Abstract: In this paper, we present a new compact vision sensor consisting of two fisheye event cameras mounted back-to-back, which offers a full 360-degree view of the surrounding environment. We describe the optical design, projection model and practical calibration using the incoming stream of events, of the novel stereo camera, called SFERA. The potential of SFERA for real-time target tracking is evaluated using a Bayesian estimator adapted to the geometry of the sphere. Real-world experiments with a prototype of SFERA, including two synchronized Prophesee EVK4 cameras and a DJI Mavic Air 2 quadrotor, show the effectiveness of the proposed system for aerial surveillance.

10:25-10:30, Paper TuAT1.7	Add to My Program
Learning-Based Adaptive Navigation for Scalar Field Mapping and Feature Tracking

Fuentes, Jose	Florida International University
Padrao, Paulo	Florida International University
Redwan Newaz, Abdullah Al	University of New Orleans
Bobadilla, Leonardo	Florida International University
Keywords: Marine Robotics, Environment Monitoring and Management, Field Robots Abstract: Scalar field features such as extrema, contours, and saddle points are essential for applications in environmental monitoring, search and rescue, and resource exploration. Traditional navigation methods often rely on predefined trajectories, leading to inefficient and resource-intensive mapping. This paper introduces a new adaptive navigation framework that leverages learning techniques to enhance exploration efficiency and effectiveness in scalar fields, even under noisy data and obstacles. The framework employs Partial Differential Equations to model scalar fields and a Gaussian Process Regressor to estimate the fields and their gradients, enabling real-time path adjustments and obstacle avoidance. We provide a theoretical foundation for the approach and address several limitations found in existing methods. The effectiveness of our framework is demonstrated through simulation benchmarks and field experiments with an Autonomous Surface Vehicle, showing improved efficiency and adaptability compared to traditional methods and offering a robust solution for real-time environmental monitoring.


TuAT2 Regular Session, 301	Add to My Program
SLAM 1

Chair: Indelman, Vadim	Technion - Israel Institute of Technology
Co-Chair: Sartoretti, Guillaume Adrien	National University of Singapore (NUS)

09:55-10:00, Paper TuAT2.1	Add to My Program
Measurement Simplification in Rho-POMDP with Performance Guarantees

Yotam, Tom	Technion
Indelman, Vadim	Technion - Israel Institute of Technology
Keywords: SLAM, Motion and Path Planning, Autonomous Agents, Foundations of Automation Abstract: Decision making under uncertainty is at the heart of any autonomous system acting with imperfect information. The cost of solving the decision making problem is exponential in the action and observation spaces, thus rendering it unfeasible for many online systems. This paper introduces a novel approach to efficient decision-making, by partitioning the high-dimensional observation space. Using the partitioned observation space, we formulate analytical bounds on the expected information- theoretic reward, for general belief distributions. These bounds are then used to plan efficiently while keeping performance guar- antees. We show that the bounds are adaptive, computationally efficient, and that they converge to the original solution. We extend the partitioning paradigm and present a hierarchy of partitioned spaces that allows greater efficiency in planning. We then propose a specific variant of these bounds for Gaussian beliefs and show a theoretical performance improvement of at least a factor of 4. Finally, we compare our novel method to other state of the art algorithms in active SLAM scenarios, in simulation and in real experiments. In both cases we show a sign

10:00-10:05, Paper TuAT2.2	Add to My Program
VSS-SLAM: Voxelized Surfel Splatting for Geometally Accurate SLAM

Chen, Xuanhua	Northeastern University
Zhang, Yunzhou	Northeastern University
Zhang, Zhiyao	Northeastern University
Wang, Guoqing	Northeastern University
Zhao, Bin	Northeastern University
Wang, Xingshuo	Northeastern University
Keywords: Deep Learning Methods, Mapping, SLAM Abstract: Visual Simultaneous Localization and Mapping (SLAM) helps robots estimate their poses and perceive the environment in unknown settings. Recent work has demonstrated that implicit neural radiance fields and 3D Gaussian Splatting (3DGS) offer higher fidelity scene representation than traditional map representations. We propose VSS-SLAM, which utilizes voxelized surfels as the map representation for incremental mapping in unknown environments. This representation effectively addresses the issue of redundant and disordered primitives encountered in previous methods, thereby enhancing geometric accuracy during reconstruction. Specifically, our approach divides the scene using voxels and stores geometric and appearance information in feature vectors at the voxel vertices. Before rendering, these feature vectors are decoded to generate the corresponding surfels. Additionally, we align camera poses through image and depth rendering. Extensive experiments on the Replica and TUM RGBD datasets demonstrate that VSS-SLAM delivers high-fidelity reconstruction and accurate pose estimation in both simulated and real-world environments. Source code will soon be available.

10:05-10:10, Paper TuAT2.3	Add to My Program
New Graph Distance Measures and Matching of Topological Maps for Robotic Exploration

Morbidi, Fabio	Université De Picardie Jules Verne
Keywords: Mapping, Autonomous Agents, SLAM Abstract: Comparing graph-structured maps is a task of paramount importance in robotic exploration and cartography, but unfortunately the computational cost of the existing similarity measures, such as the graph edit distance (GED), is prohibitive for large graphs. In this paper, we introduce and characterize three new graph distance measures which satisfy the requirements for a metric. The first one, "LogEig", computes the square root of the sum of the squared logarithms of the generalized eigenvalues of the shifted Laplacian matrices associated with the two graphs, while the second calculates the Bures distance between these positive definite matrices. The third distance, "Rank", computes the rank of the difference of the graph shift operators associated with the two graphs, e.g. the adjacency or the Laplacian matrix. Examples and numerical experiments with graphs from a publicly-available dataset, show the accuracy and computational efficiency of the new metrics for 2D topological-map matching, compared to the GED. The effect of spectral sparsification on the new graph distance measures is examined as well.

10:10-10:15, Paper TuAT2.4	Add to My Program
EnvoDat: A Large-Scale Multisensory Dataset for Robotic Spatial Awareness and Semantic Reasoning in Heterogeneous Environments

Nwankwo, Linus Ebere	University of Leoben
Ellensohn, Björn	Montanuniversitaet Leoben
Dave, Vedant	Montanuniversität Leoben
Hofer, Peter	Theresianische Militarakademie
Forstner, Jan	Montanuniversität Leoben
Villneuve, Marlene	Montanuniversität Leoben
Galler, Robert	Robert.galler@unileoben.ac.at
Rueckert, Elmar	Montanuniversitaet Leoben
Keywords: Data Sets for SLAM, Data Sets for Robotic Vision, Semantic Scene Understanding Abstract: To ensure the efficiency of robot autonomy under diverse real-world conditions, a high-quality heterogeneous dataset is essential to benchmark the operating algorithms' performance and robustness. Current benchmarks predominantly focus on urban terrains, specifically for on-road autonomous driving, leaving multi-degraded, densely vegetated, dynamic and feature-sparse environments, such as underground tunnels, natural fields, and modern indoor spaces underrepresented. To fill this gap, we introduce EnvoDat, a large-scale, multi-modal dataset collected in diverse environments and conditions, including high illumination, fog, rain, and zero visibility at different times of the day. Overall, EnvoDat contains 26 sequences from 13 scenes, 10 sensing modalities, over 1.9TB of data, and over 89K fine-grained polygon-based annotations for more than 82 object and terrain classes. We post-processed EnvoDat in different formats that support benchmarking SLAM and supervised learning algorithms, and fine-tuning multimodal vision models. With EnvoDat, we contribute to environment-resilient robotic autonomy in areas where the conditions are extremely challenging. The datasets and other relevant resources can be accessed through https://linusnep.github.io/EnvoDat/.

10:15-10:20, Paper TuAT2.5	Add to My Program
Probabilistic Degeneracy Detection for Point-To-Plane Error Minimization

Hatleskog, Johan	Norwegian University of Science and Technology
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Keywords: SLAM, Probability and Statistical Methods Abstract: Degeneracies arising from uninformative geometry are known to deteriorate LiDAR-based localization and mapping. This work introduces a new probabilistic method to detect and mitigate the effect of degeneracies in point-to-plane error minimization. The noise on the Hessian of the point-to-plane optimization problem is characterized by the noise on points and surface normals used in its construction. We exploit this characterization to quantify the probability of a direction being degenerate. The degeneracy-detection procedure is used in a new real-time degeneracy-aware iterative closest point algorithm for LiDAR registration, in which we smoothly attenuate updates in degenerate directions. The method's parameters are selected based on the noise characteristics provided in the LiDAR's datasheet. We validate the approach in four real-world experiments, demonstrating that it outperforms state-of-the-art methods at detecting and mitigating the adverse effects of degeneracies. For the benefit of the community, we release the code for the method at: github.com/ntnu-arl/drpm.

10:20-10:25, Paper TuAT2.6	Add to My Program
SCE-LIO: An Enhanced Lidar Inertial Odometry by Constructing Submap Constraints

Sun, Chao	Beijing Institute of Technology
Huang, Zhishuai	Beijing Institute of Technology
Wang, Bo	Shenzhen Automotive Research Institute, BIT
Xiao, Mancheng	ShenZhen Boundless Sensor Technology Co., Ltd
Leng, Jianghao	Beijing Institute of Technology
Li, Jiajun	Shenzhen Automotive Research Institute, Beijing Institute of Tec
Keywords: SLAM, Mapping, Autonomous Vehicle Navigation Abstract: In lidar-based Simultaneous Localization and Mapping (SLAM) systems, loop closure detection is crucial for enhancing the accuracy of odometry. However, constraints from loop closure detection are only provided when a loop is detected and can only enhance odometry accuracy at specific moments. Therefore, this paper proposes a lidar inertial odometry system that periodically provides submap constraints to the pose graph and enhances odometry accuracy through pose graph optimization. In the process of creating submap constraints, the system represents lidar keyframes as a collection of submaps containing overlapping information. The optimal pose transformations between submaps, determined using the Iterative Closest Point (ICP) algorithm with point-to-line and point-to-plane methods, are recognized as submap constraints. During the backend optimization phase, submap constraints and adjacent lidar keyframe constraints are integrated into the pose graph. The pose graph is then optimized using the pose graph optimization method to achieve the optimal lidar pose estimation. Additionally, To further enhance pose estimation, point-to-plane correspondence is established by considering the differences in normal vectors of feature points between the scan and the map, and integrated initial positioning module is created by incorporating preintegration and scan-to-scan matching. The results of simulation, public datasets and vehicle experiments show that the accuracy of the proposed algorithm is significantly improved compared to the advanced SLAM algorithm.

10:25-10:30, Paper TuAT2.7	Add to My Program
HDPlanner: Advancing Autonomous Deployments in Unknown Environments through Hierarchical Decision Networks

Liang, Jingsong	National University of Singapore
Cao, Yuhong	National University of Singapore
Ma, Yixiao	National University of Singapore
Zhao, Hanqi	Georgia Institute of Technology
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Keywords: AI-Based Methods, View Planning for SLAM, Motion and Path Planning Abstract: In this paper, we introduce HDPlanner, a deep reinforcement learning (DRL) based framework designed to tackle two core and challenging tasks for mobile robots: autonomous exploration and navigation, where the robot must optimize its trajectory adaptively to achieve the task objective through continuous interactions in unknown environments. Specifically, HDPlanner relies on novel hierarchical attention networks to empower the robot to reason about its belief across multiple spatial scales and sequence collaborative decisions, where our networks decompose long-term objectives into short-term informative task assignments and informative path plannings. We further propose a contrastive learning-based joint optimization to enhance the robustness of HDPlanner. We empirically demonstrate that HDPlanner significantly outperforms state-of-the-art conventional and learning-based baselines on an extensive set of simulations, including hundreds of test maps and large-scale, complex Gazebo environments. Notably, HDPlanner achieves real-time planning with travel distances reduced by up to 35.7% compared to exploration benchmarks and by up to 16.5% than navigation benchmarks. Furthermore, we validate our approach on hardware, where it generates high-quality, adaptive trajectories in both indoor and outdoor environments, highlighting its real-world applicability without additional training.


TuAT3 Regular Session, 303	Add to My Program
3D Content Capture and Generation 1

Chair: Schieber, Hannah	Human-Centered Computing and Extended Reality, Technical University of Munich, School of Medicine and Health, Klinikum Rechts De
Co-Chair: Zhu, Minghan	University of Michigan

09:55-10:00, Paper TuAT3.1	Add to My Program
WeatherGS: 3D Scene Reconstruction in Adverse Weather Conditions Via Gaussian Splatting

Qian, Chenghao	University of Leeds
Guo, Yuhu	Carnegie Mellon University
Li, Wenjing	University of Leeds
Markkula, Gustav	University of Leeds
Keywords: Computer Vision for Automation, Computer Vision for Transportation, Visual Learning Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for 3D scene reconstruction, but still suffers from complex outdoor environments, especially under adverse weather. This is because 3DGS treats the artifacts caused by adverse weather as part of the scene and will directly reconstruct them, largely reducing the clarity of the reconstructed scene. To address this challenge, we propose WeatherGS, a 3DGS-based framework for reconstructing clear scenes from multi-view images under different weather conditions. Specifically, we explicitly categorize the multi-weather artifacts into the dense particles and lens occlusions that have very different characters, in which the former are caused by snowflakes and rain streaks in the air, and the latter are raised by the precipitation on the camera lens. In light of this, we propose a dense-to-sparse preprocess strategy, which sequentially removes the dense particles by an Atmospheric Effect Filter (AEF) and then extracts the relatively sparse occlusion masks with a Lens Effect Detector (LED). Finally, we train a set of 3D Gaussians by the processed images and generated masks for excluding occluded areas, and accurately recover the underlying clear scene by Gaussian splatting. We conduct a diverse and challenging benchmark to facilitate the evaluation of 3D reconstruction under complex weather scenarios. Extensive experiments on this benchmark demonstrate that our WeatherGS consistently produces high-quality, clean scenes across various weather scenarios, outperforming existing state-of-the-art methods.

10:00-10:05, Paper TuAT3.2	Add to My Program
RL-GSBridge: 3D Gaussian Splatting Based Real2Sim2Real Method for Robotic Manipulation Learning

Wu, Yuxuan	Shanghai Jiao Tong University
Pan, Lei	China University of Mining and Technology
Wu, Wenhua	Shang Hai Jiao Tong University
Wang, Guangming	University of Cambridge
Miao, Yanzi	China University of Mining and Technology
Xu, Fan	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Computer Vision for Automation, Deep Learning in Grasping and Manipulation, Reinforcement Learning Abstract: Sim-to-Real refers to the process of transferring policies learned in simulation to the real world, which is crucial for achieving practical robotics applications. However, recent Sim2real methods either rely on a large amount of augmented data or large learning models, which is inefficient for specific tasks. In recent years, with the emergence of radiance field reconstruction methods, especially 3D Gaussian splatting, it has become possible to construct realistic real-world scenes. To this end, we propose RL-GSBridge, a novel real-to-sim-to-real framework which incorporates 3D Gaussian Splatting into the conventional RL simulation pipeline, enabling zero-shot simto-real transfer for vision-based deep reinforcement learning. We introduce a mesh-based 3D GS method with soft binding constraints, enhancing the rendering quality of mesh models. Then utilizing a GS editing approach to synchronize the rendering with the physics simulator, RL-GSBridge could reflect the visual interactions of the physical robot accurately. Through a series of sim-to-real experiments, including grasping and pickand-place tasks, we demonstrate that RL-GSBridge maintains a satisfactory success rate in real-world task completion during sim-to-real transfer. Furthermore, a series of rendering metrics and visualization results indicate that our proposed mesh-based 3D GS reduces artifacts in unstructured objects, demonstrating more realistic rendering performance.

10:05-10:10, Paper TuAT3.3	Add to My Program
High-Quality 3D Creation from a Single Image Using Subject-Specific Knowledge Prior

Huang, Nan	Peking University
Zhang, Ting	Beijing Normal University
Yuan, Yuhui	Microsoft Research Asia
Chen, Dong	Microsoft Research Asia
Zhang, Shanghang	Peking University
Keywords: Computer Vision for Automation Abstract: In this paper, we address the critical bottleneck in robotics caused by the scarcity of diverse 3D data by presenting a novel two-stage approach for generating high-quality 3D models from a single image. This method is motivated by the need to efficiently expand 3D asset creation, particularly for robotics datasets, where the variety of object types is currently limited compared to general image datasets. Unlike previous methods that primarily rely on general diffusion priors, which often struggle to align with the reference image, our approach leverages subject-specific prior knowledge. By incorporating subject-specific priors in both geometry and texture, we ensure precise alignment between the generated 3D content and the reference object. Specifically, we introduce a shading mode-aware prior into the NeRF optimization process, enhancing the geometry and refining texture in the coarse outputs to achieve superior quality. Extensive experiments demonstrate that our method significantly outperforms prior approaches. Our approach is well-suited for applications such as novel view synthesis, text-to-3D, and image-to-3D, particularly in the robotics field where diverse 3D data is essential.

10:10-10:15, Paper TuAT3.4	Add to My Program
DGTR: Distributed Gaussian Turbo-Reconstruction for Sparse-View Vast Scenes

Li, Hao	Northwestern Polytechnology University
Gao, Yuanyuan	Northwestern Polytechnical University
Peng, Haosong	Beijing Institute of Technology
Wu, Chenming	Baidu Research
Ye, Weicai	Zhejiang University
Zhan, Yufeng	Beijing Institute of Technology
Zhao, Chen	Baidu Inc
Zhang, Dingwen	Northwestern Polytechnical University
Wang, Jingdong	Baidu
Han, Junwei	Northwestern Polytechnical University
Keywords: Computer Vision for Transportation Abstract: 小说视图合成（NVS）方法起着关键作用在大型场景重建中。然而，这些方法依赖于严重依赖密集的图像输入和延长的训练时间，使它们不适合计算资源所在的位置有限。此外，小样本方法通常难以在广阔的环境中重建质量差。这论文介绍了 DGTR，这是一种新颖的分布式框架，用于稀疏视图 vast 的高效高斯重建场景。我们的方法将场景划分为多个区域，由具有稀疏图像输入的无人机独立处理。使用前馈高斯模型，我们预测高质量的高斯基元，然后是全局对齐算法来确保几何一致性。综合视图和深度先验被合并到进一步增强训练，而基于蒸馏的模型聚合机制可实现高效的&#

10:15-10:20, Paper TuAT3.5	Add to My Program
LiDAR-EDIT: LiDAR Data Generation by Editing the Object Layouts in Real-World Scenes

Ho, Shing-Hei	University of Utah
Thach, Bao	University of Utah
Zhu, Minghan	University of Michigan
Keywords: Computer Vision for Transportation, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: We present LiDAR-EDIT, a novel paradigm for generating synthetic LiDAR data for autonomous driving. Our framework edits real-world LiDAR scans by introducing new object layouts while preserving the realism of the background environment. Compared to end-to-end frameworks that generate LiDAR point clouds from scratch, LiDAR-EDIT offers users full control over the object layout, including the number, type, and pose of objects, while keeping most of the original real-world background. Our method also provides object labels for the generated data. Compared to novel view synthesis techniques, our framework allows for the creation of counterfactual scenarios with object layouts significantly different from the original real-world scene. LiDAR-EDIT uses spherical voxelization to enforce correct LiDAR projective geometry in the generated point clouds by construction. During object removal and insertion, generative models are employed to fill the unseen background and object parts that were occluded in the original real Lidar scans. Experimental results demonstrate that our framework produces realistic LiDAR scans with practical value for downstream tasks. Project website with open-sourced code: https://sites.google.com/view/lidar-edit

10:20-10:25, Paper TuAT3.6	Add to My Program
TICMapNet: A Tightly Coupled Temporal Fusion Pipeline for Vectorized HD Map Learning

Qiu, Wenzhao	Xi'an Jiaotong University
Pang, Shanmin	Xi'an Jiaotong University
Zhang, Hao	Xi'an Jiaotong University
Fang, Jianwu	Xian Jiaotong University
Xue, Jianru	Xi'an Jiaotong University
Keywords: Mapping, Deep Learning for Visual Perception, Visual Learning Abstract: High-Definition (HD) map construction is essential for autonomous driving to accurately understand the surrounding environment. Most existing methods rely on single-frame inputs to predict local map, which often fail to effectively capture the temporal correlations between frames. This limitation results in discontinuities and instability in the generated map.To tackle this limitation, we propose a textit{Ti}ghtly textit{C}oupled temporal fusion textit{Map} textit{Net}work (TICMapNet). TICMapNet breaks down the fusion process into three sub-problems: PV feature alignment, BEV feature adjustment, and Query feature fusion. By doing so, we effectively integrate temporal information at different stages through three plug-and-play modules, using the proposed tightly coupled strategy. Unlike traditional methods, our approach does not rely on camera extrinsic parameters, offering a new perspective for addressing the visual fusion challenge in the field of object detection. Experimental results show that TICMapNet significantly improves upon its single-frame baseline model, achieving at least a 7.0% increase in mAP using just two consecutive frames on the nuScenes dataset, while also showing generalizability across other tasks. The code and demos are available at url{https://github.com/adasfag/TICMapNet}.

10:25-10:30, Paper TuAT3.7	Add to My Program
DynaMoN: Motion-Aware Fast and Robust Camera Localization for Dynamic Neural Radiance Fields

Schischka, Nicolas	Technical University of Munich
Schieber, Hannah	Human-Centered Computing and Extended Reality, Technical Univers
Karaoglu, Mert Asim	Technical University of Munich, ImFusion GmbH
Görgülü, Melih	Technical University of Munich
Grötzner, Florian	Technical University of Munich
Ladikos, Alexander	ImFusion
Navab, Nassir	TU Munich
Roth, Daniel	Technical University of Munich, Klinikum Rechts Der Isar
Busam, Benjamin	Technical University of Munich
Keywords: Localization, Mapping Abstract: The accurate reconstruction of dynamic scenes with neural radiance fields is significantly dependent on the estimation of camera poses. Widely used structure-from-motion pipelines encounter difficulties in accurately tracking the camera trajectory when faced with separate dynamics of the scene content and the camera movement. To address this challenge, we propose Dynamic Motion-Aware Fast and Robust Camera Localization for Dynamic Neural Radiance Fields (DynaMoN). DynaMoN utilizes semantic segmentation and generic motion masks to handle dynamic content for initial camera pose estimation and statics-focused ray sampling for fast and accurate novel-view synthesis. Our novel iterative learning scheme switches between training the NeRF and updating the pose parameters for an improved reconstruction and trajectory estimation quality. The proposed pipeline shows significant acceleration of the training process. We extensively evaluate our approach on two real-world dynamic datasets, the TUM RGB-D dataset and the BONN RGB-D Dynamic dataset. DynaMoN improves over the state-of-the-art both in terms of reconstruction quality and trajectory accuracy. We plan to make our code public to enhance research in this area.


TuAT4 Regular Session, 304	Add to My Program
Vision-Based Tactile Sensing 1

Chair: Wang, Dongyi	University of Arkansas
Co-Chair: Luo, Shan	King's College London

09:55-10:00, Paper TuAT4.1	Add to My Program
TransForce: Transferable Force Prediction for Vision-Based Tactile Sensors with Sequential Image Translation

Chen, Zhuo	King's College London
Ou, Ni	Beijing Institute of Technology
Zhang, Xuyang	University of Bristol
Luo, Shan	King's College London
Keywords: Force and Tactile Sensing Abstract: Vision-based tactile sensors (VBTSs) provide high-resolution tactile images crucial for robot in-hand manipulation. However, force sensing in VBTSs is underutilized due to the costly and time-intensive process of acquiring paired tactile images and force labels. In this study, we introduce a transferable force prediction model, TransForce, designed to leverage collected image-force paired data for new sensors under varying illumination colors and marker patterns while improving the accuracy of predicted forces, especially in the shear direction. Our model effectively achieves translation of tactile images from the source domain to the target domain, ensuring that the generated tactile images reflect the illumination colors and marker patterns of the new sensors while accurately aligning the elastomer deformation observed in existing sensors, which is beneficial to force prediction of new sensors. As such, a recurrent force prediction model trained with generated sequential tactile images and existing force labels is employed to estimate higher-accuracy forces for new sensors with lowest average errors of 0.69N (5.8% in full work range) in x-axis, 0.70N (5.8%) in y-axis, and 1.11N (6.9%) in z-axis compared with models trained with single images. The experimental results also reveal that pure marker modality is more helpful than the RGB modality in improving the accuracy of force in the shear direction, while the RGB modality show better performance in the normal direction.

10:00-10:05, Paper TuAT4.2	Add to My Program
HumanFT: A Human-Like Fingertip Multimodal Visuo-Tactile Sensor

Wu, Yifan	ShanghaiTech University
Chen, Yuzhou	ShanghaiTech University
Zhu, Zhengying	ShanghaiTech University
Qin, Xuhao	Shanghaitech University
Xiao, Chenxi	ShanghaiTech University
Keywords: Force and Tactile Sensing, Multi-Modal Perception for HRI, Soft Sensors and Actuators Abstract: Tactile sensors play a crucial role in enabling robots to interact effectively and safely with objects in everyday tasks. In particular, visuotactile sensors have seen increasing usage in two and three-fingered grippers due to their high-quality feedback. However, a significant gap remains in the development of sensors suitable for humanoid robots, especially five-fingered dexterous hands. One reason is because of the challenges in designing and manufacturing sensors that are compact in size. In this paper, we propose HumanFT, a multimodal visuotactile sensor that replicates the shape and functionality of a human fingertip. To bridge the gap between human and robotic tactile sensing, our sensor features real-time force measurements, high-frequency vibration detection, and overtemperature alerts. To achieve this, we developed a suite of fabrication techniques for a new type of elastomer optimized for force propagation and temperature sensing. Besides, our sensor integrates circuits capable of sensing pressure and vibration. These capabilities have been validated through experiments. The proposed design is simple and cost-effective to fabricate. We believe HumanFT can enhance humanoid robots' perception by capturing and interpreting multimodal tactile information.

10:05-10:10, Paper TuAT4.3	Add to My Program
FeelAnyForce: Estimating Contact Force Feedback from Tactile Sensation for Vision-Based Tactile Sensors

Shahidzadeh, Amir Hossein	University of Maryland
Caddeo, Gabriele Mario	Istituto Italiano Di Tecnologia
Alapati, Koushik	University of Maryland, College-Park
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Fermuller, Cornelia	University of Maryland
Aloimonos, Yiannis	University of Maryland
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation Abstract: In this paper, we tackle the problem of estimating 3D contact forces using vision-based tactile sensors. In particular, our goal is to estimate contact forces over a large range (up to 15 N) on any objects while generalizing across different vision-based tactile sensors. Thus, we collected a dataset of over 200K indentations using a robotic arm that pressed various indenters onto a GelSight Mini sensor mounted on a force sensor and then used the data to train a multi-head transformer for force regression. Strong generalization is achieved via accurate data collection and multi-objective optimization that leverages depth contact images. Despite being trained only on primitive shapes and textures, the regressor achieves a mean absolute error of 4% on a dataset of unseen real-world objects. We further evaluate our approach's generalization capability to other GelSight mini and DIGIT sensors, and propose a reproducible calibration procedure for adapting the pre-trained model to other vision-based sensors. Furthermore, the method was evaluated on real-world tasks, including weighing objects and controlling the deformation of delicate objects, which relies on accurate force feedback.

10:10-10:15, Paper TuAT4.4	Add to My Program
VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

George, Abraham	Carnegie Mellon University
Gano, Selam	Carnegie Mellon University
Katragadda, Pranav	Carnegie Mellon University
Barati Farimani, Amir	Carnegie Mellon University
Keywords: Force and Tactile Sensing, Deep Learning in Grasping and Manipulation, Imitation Learning Abstract: Tactile information is a critical tool for dexterous manipulation. As humans, we rely heavily on tactile information to understand objects in our environments and how to interact with them. We use touch not only to perform manipulation tasks but also to learn how to perform these tasks. Therefore, to create robotic agents that can learn to complete manipulation tasks at a human or super-human level of performance, we need to properly incorporate tactile information into both skill execution and skill learning. In this paper, we investigate how we can incorporate tactile information into imitation learning platforms to improve performance on manipulation tasks. We show that incorporating visuo-tactile pretraining improves imitation learning performance, not only for tactile agents (policies that use tactile information at inference), but also for non-tactile agents (policies that do not use tactile information at inference). For these non-tactile agents, pretraining with tactile information significantly improved performance (for example, improving the accuracy on USB plugging from 20% to 85%), reaching a level on par with visuo-tactile agents, and even surpassing them in some cases. For demonstration videos and access to our codebase, see the project website: https://sites.google.com/andrew.cmu.edu/visuo-tactile-pretr aining

10:15-10:20, Paper TuAT4.5	Add to My Program
EasyCalib: Simple and Low-Cost In-Situ Calibration for Force Reconstruction with Vision-Based Tactile Sensors

Li, Mingxuan	Tsinghua University
Zhang, Lunwei	Tsinghua University
Zhou, Yen Hang	Tsinghua University
Li, Tiemin	Tsinghua University
Jiang, Yao	Tsinghua University
Keywords: Force and Tactile Sensing, Contact Modeling, Haptics and Haptic Interfaces Abstract: For elastomer-based tactile sensors, represented by vision-based tactile sensors, routine calibration of mechanical parameters (Young's modulus and Poisson's ratio) has been shown to be important for force reconstruction. However, the reliance on existing in-situ calibration methods for accurate force measurements limits their cost-effective and flexible applications. This article proposes a new in-situ calibration scheme that relies only on comparing contact deformation. Based on the detailed derivations of the normal contact and torsional contact theories, we designed a simple and low-cost calibration device, EasyCalib, and validated its effectiveness through extensive finite element analysis. We also explored the accuracy of EasyCalib in the practical application and demonstrated that accurate contact distributed force reconstruction can be realized based on the mechanical parameters obtained. EasyCalib balances low hardware cost, ease of operation, and low dependence on technical expertise and is expected to provide the necessary accuracy guarantees for wide applications of visuotactile sensors.

10:20-10:25, Paper TuAT4.6	Add to My Program
NormalFlow: Fast, Robust, and Accurate Contact-Based Object 6DoF Pose Tracking with Vision-Based Tactile Sensors

Huang, Hung-Jui	Carnegie Mellon University
Kaess, Michael	Carnegie Mellon University
Yuan, Wenzhen	University of Illinois
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation Abstract: Tactile sensing is crucial for robots aiming to achieve human-level dexterity. Among tactile-dependent skills, tactile-based object tracking serves as the cornerstone for many tasks, including manipulation, in-hand manipulation, and 3D reconstruction. In this work, we introduce NormalFlow, a fast, robust, and real-time tactile-based 6DoF tracking algorithm. Leveraging the precise surface normal estimation of vision-based tactile sensors, NormalFlow determines object movements by minimizing discrepancies between the tactile-derived surface normals. Our results show that NormalFlow consistently outperforms competitive baselines and can track low-texture objects like table surfaces. For long-horizon tracking, we demonstrate when rolling the sensor around a bead for 360 degrees, NormalFlow maintains a rotational tracking error of 2.5 degrees. Additionally, we present state-of-the-art tactile-based 3D reconstruction results, showcasing the high accuracy of NormalFlow. We believe NormalFlow unlocks new possibilities for high-precision perception and manipulation tasks that involve interacting with objects using hands. Please also check our supplementary video to see our method in action.


TuAT5 Regular Session, 305	Add to My Program
Aerial Robots 1

Chair: Cheng, Bo	Pennsylvania State University
Co-Chair: Scaramuzza, Davide	University of Zurich

09:55-10:00, Paper TuAT5.1	Add to My Program
Nezha-MB: Design and Implementation of a Morphing Hybrid Aerial-Underwater Vehicle

Xu, Zhuxiu	Shanghai Jiao Tong University
Shen, Yishu	Shanghai Jiao Tong University
Bi, Yuanbo	Shanghai Jiao Tong University
Zeng, Baichuan	The Chinese University of Hong Kong
Zeng, Zheng	Shanghai Jiao Tong University
Keywords: Marine Robotics, Field Robots, Aerial Systems: Applications Abstract: 航空水下航行器（HAUV）由于能够在空中和水域中无缝运行，因此表现出巨大的潜力。然而，在两种介质的快速可作性和在跨域阶段实现稳定性仍然是一个重大挑战。受到可伸缩四肢的启发，本文提出了一种新颖的变形 HAUV，哪吒-MB。哪吒-MB 在过渡阶段利用线性执行器结合齿轮和齿条系统进行手臂转换，取代传统的伺服系统。转化机构占总重量的 11%。在空中模式下，哪吒-MB 表现出与四旋翼配置相当的飞行性能。在水下模式下，哪吒-MB 将其四旋翼臂缩回子弹形外壳中，显着减少阻力和能耗，同时能够通过直径小至 134 毫米的狭窄间隙。在空中和水下领域进行的模拟和现场测试表ą

10:00-10:05, Paper TuAT5.2	Add to My Program
From Ceilings to Walls: Universal Dynamic Perching of Small Aerial Robots on Surfaces with Variable Orientations

Habas, Bryan	The Pennsylvania State University
Brown, Aaron C.	The Pennsylvania State University
Lee, Donghyeon	The Pennsylvania State University
Goldman, Mitchell	Penn State University
Cheng, Bo	Pennsylvania State University
Keywords: Aerial Systems: Applications, Surveillance Robotic Systems, AI-Enabled Robotics Abstract: This work demonstrates universal dynamic perching capabilities for quadrotors of various sizes and on surfaces with different orientations. By employing a non-dimensionalization framework and deep reinforcement learning, we systematically assessed how robot size and surface orientation affect landing capabilities. We hypothesized that maintaining geometric proportions across different robot scales ensures consistent perching behavior, which was validated in both simulation and experimental tests. Additionally, we investigated the effects of joint stiffness and damping in the landing gear on perching behaviors and performance. While joint stiffness had minimal impact, joint damping ratios influenced landing success under vertical approaching conditions. The study also identified a critical velocity threshold necessary for successful perching, determined by the robot's maneuverability and leg geometry. Overall, this research advances robotic perching capabilities, offering insights into the role of mechanical design and scaling effects, and lays the groundwork for future drone autonomy and operational efficiency in unstructured environments.

10:05-10:10, Paper TuAT5.3	Add to My Program
Towards Perpetually-Deployable Ubiquitous Aerial Robotics: An Amphibious Self-Sustainable Solar Small-UAS

Carlson, Stephen	University of Nevada, Reno
Arora, Prateek	University of Nevada, Reno
Papachristos, Christos	University of Nevada Reno
Keywords: Field Robots, Aerial Systems: Applications Abstract: This work deals with the problem of unlocking perpetual deployment capabilities for small-UAS robotics across the diverse settings of the real world and their challenges, encompassing considerations for marine environments alongside the more common terrestrial ones. Via the progress made within this scope, a step towards truly ubiquitous and self-sustainable aerial robotics is accomplished. The work consists of the development of the Gannet Solar-VTOL, a waterproof small-UAS that is capable of resting on the surface of water for prolonged periods of time and over varying temperature ranges, while harvesting solar power to recharge itself. Equally importantly, it integrates a field-proven Self-Sustainable Autonomous System architecture that allows it to hibernate and sustain its battery charge overnight or during periods of solar illumination scarcity, as well as to assess mission-critical parameters (e.g., water surface turbulence, ambient temperature of battery compartment) on the low-power side of the Power Management Stack, and react appropriately. Finally, the robot is equipped with an onboard camera and a Neural Processing Unit that allows it to perform in-field environmental monitoring operations (e.g., wildfire detection). This paper experimentally demonstrates the aforementioned capabilities, and concludes with a presentation of the amphibious small-UAS' long-term deployment within a marine environment in the N.Nevada region, spanning over 3 consecutive days.

10:10-10:15, Paper TuAT5.4	Add to My Program
Autonomous Drone for Dynamic Smoke Plume Tracking

Pal, Srijan Kumar	University of Minnesota
Sharma, Shashank	University of Minnesota
Krishnakumar, Nikil	University of Minnesota
Hong, Jiarong	University of Minnesota
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Vision-Based Navigation Abstract: This paper presents a novel autonomous drone-based smoke plume tracking system capable of navigating and tracking plumes in highly unsteady atmospheric conditions. The system integrates advanced hardware and software and a comprehensive simulation environment to ensure robust performance in controlled and real-world settings. The quadrotor, equipped with a high-resolution imaging system and an advanced onboard computing unit, performs precise maneuvers while accurately detecting and tracking dynamic smoke plumes under fluctuating conditions. Our software implements a two-phase flight operation: descending into the smoke plume upon detection and continuously monitoring the smoke's movement during in-plume tracking. Leveraging Proportional Integral–Derivative (PID) control and a Proximal Policy Optimization (PPO) based Deep Reinforcement Learning (DRL) controller enables adaptation to plume dynamics. Unreal Engine simulation evaluates performance under various smoke-wind scenarios, from steady flow to complex, unsteady fluctuations, showing that while the PID controller performs adequately in simpler scenarios, the DRL-based controller excels in more challenging environments. Field tests corroborate these findings. This system opens new possibilities for drone-based monitoring in areas like wildfire management and air quality assessment. The successful integration of DRL for real-time decision-making advances autonomous drone control for dynamic environments.

10:15-10:20, Paper TuAT5.5	Add to My Program
EvMAPPER: High-Altitude Orthomapping with Event Cameras

Cladera, Fernando	University of Pennsylvania
Chaney, Kenneth	University of Pennsylvania
Hsieh, M. Ani	University of Pennsylvania
Taylor, Camillo Jose	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Mapping, Field Robots, Aerial Systems: Applications Abstract: Traditionally, unmanned aerial vehicles (UAVs) rely on CMOS-based cameras to collect images about the world below. One of the most successful applications of UAVs is to generate orthomosaics or orthomaps, in which a series of images are integrated to develop a larger map. However, using CMOS-based cameras with global or rolling shutters means that orthomaps are vulnerable to challenging light conditions, motion blur, and high-speed motion of independently moving objects (IMOs) under the camera. Event cameras are less sensitive to these issues, as their pixels trigger asynchronously on brightness changes. This work introduces the first orthomosaic approach using event cameras. We focus on addressing high-dynamic range and low-light problems in orthomosaics. In contrast to existing methods relying only on CMOS cameras, our approach enables map generation even in challenging light conditions, including direct sunlight and after sunset.

10:20-10:25, Paper TuAT5.6	Add to My Program
Survey of Simulators for Aerial Robots: An Overview and In-Depth Systematic Comparisons

Dimmig, Cora	Johns Hopkins University
Silano, Giuseppe	Ceske Vysoke Uceni Technicke V Praze, FEL
McGuire, Kimberly	Bitcraze AB
Gabellieri, Chiara	University of Twente
Hoenig, Wolfgang	TU Berlin
Moore, Joseph	Johns Hopkins University
Kobilarov, Marin	Johns Hopkins University
Keywords: Aerial Systems: Perception and Autonomy, Simulation and Animation, Software, Middleware and Programming Environments Abstract: Uncrewed Aerial Vehicle (UAV) research faces challenges with safety, scalability, costs, and ecological impact when conducting hardware testing. High-fidelity simulators offer a vital solution by replicating real-world conditions to enable the development and evaluation of novel perception and control algorithms. However, the large number of available simulators poses a significant challenge for researchers to determine which simulator best suits their specific use-case, based on each simulator’s limitations and customization readiness. In this paper we present an overview of 44 UAV simulators, including in-depth, systematic comparisons for 14 of the simulators. Additionally, we present a set of decision factors for selection of simulators, aiming to enhance the efficiency and safety of research endeavors.

10:25-10:30, Paper TuAT5.7	Add to My Program
Robotics Meets Fluid Dynamics: A Characterization of the Induced Airflow below a Quadrotor As a Turbulent Jet

Bauersfeld, Leonard	University of Zurich (UZH)
Muller, Koen	ETH Zürich
Ziegler, Dominic	IFD, ETH Zürich
Coletti, Filippo	ETH Zürich
Scaramuzza, Davide	University of Zurich
Keywords: Aerial Systems: Applications, Calibration and Identification, Robust/Adaptive Control Abstract: The widespread adoption of quadrotors for diverse applications, from agriculture to public safety, necessitates an understanding of the aerodynamic disturbances they create. This paper introduces a computationally lightweight model for estimating the time-averaged magnitude of the induced flow below quadrotors in hover. Unlike related approaches that rely on expensive computational fluid dynamics (CFD) simulations or drone specific time-consuming empirical measurements, our method leverages classical theory from turbulent flows. By analyzing over 16 hours of flight data from drones of varying sizes within a large motion capture system, we show for the first time that the combined flow from all drone propellers is well-approximated by a turbulent jet after 2.5 drone-diameters below the vehicle. Using a novel normalization and scaling, we experimentally identify model parameters that describe a unified mean velocity field below differently sized quadrotors. The model, which requires only the drone's mass, propeller size, and drone size for calculations, accurately describes the far-field airflow over a long-range in a very large volume which is impractible to simulate using CFD. Our model offers a practical tool for ensuring safer operations near humans, optimizing sensor placements and drone control in multi-agent scenarios. We demonstrate the latter by designing a controller that compensates for the downwash of another drone, leading to a four times lower altitude deviation when passing below.


TuAT6 Regular Session, 307	Add to My Program
Perception for Mobile Robots 1

Chair: Everett, Michael	Northeastern University
Co-Chair: Liang, Claire Yilan	Cornell University

09:55-10:00, Paper TuAT6.1	Add to My Program
CoDynTrust: Robust Asynchronous Collaborative Perception Via Dynamic Feature Trust Modulus

Xu, Yunjiang	Soochow University
Li, Lingzhi	Soochow University
Wang, Jin	Soochow University
Yang, Benyuan	Xidian University
Wu, ZhiWen	Soochow University
Chen, Xinhong	City University of Hong Kong
Wang, Jianping	City University of Hong Kong
Keywords: Object Detection, Segmentation and Categorization, Multi-Robot Systems, Intelligent Transportation Systems Abstract: Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at https://github.com/CrazyShout/CoDynTrust.

10:00-10:05, Paper TuAT6.2	Add to My Program
The Devil Is in the Quality: Exploring Informative Samples for Semi-Supervised Monocular 3D Object Detection

Zhang, Zhipeng	KargoBot
Li, Zhenyu	KAUST
Wang, Hanshi	CASIA
Yuan, He	KargoBot
Wang, Ke	Kargobot.AI
Fan, Heng	University of North Texas
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Recognition Abstract: This paper tackles the challenging problem of semi-supervised monocular 3D object detection with a general framework. In specific, having observed that the bottleneck of this task lies in lacking reliable and informative samples from unlabeled data for detector learning, we introduce a novel simple yet effective `Augment and Criticize' pipeline that mines abundant informative samples for robust detection. To be more specific, in the `Augment' stage, we present the Augmentation-based Prediction aGgregation (APG), which applies automatically learned transformations to unlabeled images and aggregates detections from various augmented views as pseudo labels. Since not all the pseudo labels from APG are beneficially informative, the subsequent `Criticize' phase is introduced. Particularly, we present the Critical Retraining Strategy (CRS) that, unlike simply filtering pseudo labels using a fixed threshold, employs a learnable network to evaluate the contribution of unlabeled images at different training timestamps. This way, the noisy samples prohibitive to model evolution can be effectively suppressed. In order to validate `Augment-Criticize', we apply it to MonoDLE and MonoFlex, and the two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with consistent improvements, evidencing its effectiveness and generality.

10:05-10:10, Paper TuAT6.3	Add to My Program
MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models

Meier, Johannes	Technical University of Munich, DeepScenario
Inchingolo, Louis	Technical University of Munich
Dhaouadi, Oussema	Technical University of Munich
Xia, Yan	Technical University of Munich
Kaiser, Jacques	DeepScenario
Cremers, Daniel	Technical University of Munich
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhance- ment (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (∼21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.

10:10-10:15, Paper TuAT6.4	Add to My Program
LiDAR Inertial Odometry and Mapping Using Learned Registration-Relevant Features

Dong, Zihao	Northeastern University
Pflueger, Jeff	Northeastern University
Jung, Leonard	Northeastern University
Thorne, David	University of California, Los Angeles
Osteen, Philip	U.S. Army Research Laboratory
Robison, Christopher, Christa	Army Research Laboratory
Lopez, Brett	University of California, Los Angeles
Everett, Michael	Northeastern University
Keywords: AI-Based Methods, Localization, SLAM Abstract: SLAM is an important capability for many autonomous systems, and modern LiDAR-based methods offer promising performance. However, for long duration missions, existing works that either take directly the full pointclouds or extracted features face key tradeoffs in accuracy and computational efficiency (e.g., memory consumption). To address these issues, this paper presents DFLIOM with several key innovations. Unlike previous methods that rely on handcrafted heuristics and hand-tuned parameters for feature extraction, we propose a learning-based approach that select points relevant to LiDAR SLAM pointcloud registration. Furthermore, we extend our prior work DLIOM with the learned feature extractor and observe our method enables similar or even better localization performance using only about 20% of the points in the dense point clouds. We demonstrate that DFLIOM performs well on multiple public benchmarks, achieving a 2.4% decrease in localization error and 57.5% decrease in memory usage compared to state-of-the-art methods (DLIOM). Although extracting features with the proposed network requires extra time, it is offset by the faster processing time downstream, thus maintaining real-time performance using 20Hz LiDAR on our hardware setup. The effectiveness of our learning-based feature extraction module is further demonstrated through comparison with several handcrafted feature extractors.

10:15-10:20, Paper TuAT6.5	Add to My Program
DreamDrive: Generative 4D Scene Modeling from Street View Images

Mao, Jiageng	University of Southern California
Li, Boyi	UC Berkeley
Ivanovic, Boris	NVIDIA
Chen, Yuxiao	Nvidia Research
Wang, Yan	NVIDIA
You, Yurong	Cornell University
Xiao, Chaowei	University of Wisconsin, Madison
Xu, Danfei	Georgia Institute of Technology
Pavone, Marco	Stanford University
Wang, Yue	USC
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation, Virtual Reality and Interfaces Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present ourmethod{}, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and in-the-wild driving data demonstrate that ourmethod{} can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

10:20-10:25, Paper TuAT6.6	Add to My Program
DISORF: A Distributed Online 3D Reconstruction Framework for Mobile Robots

Li, Chunlin	University of Toronto
Fan, Hanrui	University of Toronto
Huang, Xiaorui	University of Toronto
Liang, Ruofan	University of Toronto
Durvasula, Sankeerth	University of Toronto
Vijaykumar, Nandita	University of Toronto
Keywords: Visual Learning, Incremental Learning, Mapping Abstract: We present a framework, DISORF, to enable online 3D reconstruction and visualization of scenes captured by resource-constrained mobile robots and edge devices. To address the limited computing capabilities of edge devices and potentially limited network availability, we design a framework that efficiently distributes computation between the edge device and the remote server. We leverage on-device SLAM systems to generate posed keyframes and transmit them to remote servers that can perform high-quality 3D reconstruction and visualization at runtime by leveraging recent advances in neural 3D methods. We identify a key challenge with online training where naive image sampling strategies can lead to significant degradation in rendering quality. We propose a novel shifted exponential frame sampling method that addresses this challenge for online training. We demonstrate the effectiveness of our framework in enabling high-quality real-time reconstruction and visualization of unknown scenes as they are captured and streamed from cameras in mobile robots and edge devices.

10:25-10:30, Paper TuAT6.7	Add to My Program
Key-Scan-Based Mobile Robot Navigation: Integrated Mapping, Planning, and Control Using Graphs of Scan Regions

Bashkaran Latha, Dharshan	Eindhoven University of Technology
Arslan, Omur	Eindhoven University of Technology
Keywords: Reactive and Sensor-Based Planning, Integrated Planning and Control, Motion and Path Planning Abstract: Safe autonomous navigation in a priori unknown environments is an essential skill for mobile robots to reliably and adaptively perform diverse tasks (e.g., delivery, inspection, and interaction) in unstructured cluttered environments. Hybrid metric-topological maps, constructed as a pose graph of local submaps, offer a computationally efficient world representation for adaptive mapping, planning, and control at the regional level. In this paper, we consider a pose graph of locally sensed star-convex scan regions as a metric-topological map, with star convexity enabling simple yet effective local navigation strategies. We design a new family of safe local scan navigation policies and present a perception-driven feedback motion planning method through the sequential composition of local scan navigation policies, enabling provably correct and safe robot navigation over the union of local scan regions. We introduce a new concept of frontier and bridging scans for automated key scan selection and exploration for integrated mapping and navigation in unknown environments. We demonstrate the effectiveness of our key-scan-based navigation and mapping framework using a mobile robot equipped with a 360˝ laser range scanner in 2D cluttered environments through numerical ROS-Gazebo simulations and real hardware experiments.


TuAT7 Regular Session, 309	Add to My Program
Legged Locomotion: Novel Methods

Chair: Lynch, Kevin	Northwestern University
Co-Chair: Kim, Joohyung	University of Illinois Urbana-Champaign

09:55-10:00, Paper TuAT7.1	Add to My Program
Angular Divergent Component of Motion: A Step towards Planning Spatial DCM Objectives for Legged Robots

Herron, Connor	Virginia Tech
Schuller, Robert	German Aerospace Center (DLR)
Beiter, Benjamin	Virginia Polytechnic Institute and State University
Griffin, Robert J.	Institute for Human and Machine Cognition (IHMC)
Leonessa, Alexander	Virginia Tech
Englsberger, Johannes	DLR (German Aerospace Center)
Keywords: Humanoid and Bipedal Locomotion, Body Balancing, Whole-Body Motion Planning and Control Abstract: In this work, the Divergent Component of Motion (DCM) method is expanded to include angular coordinates for the first time. This work introduces the idea of spatial DCM, which adds an angular objective to the existing linear DCM theory. To incorporate the angular component into the framework, a discussion is provided on extending beyond the linear motion of the Linear Inverted Pendulum model (LIPM) towards the Single Rigid Body model (SRBM) for DCM. This work presents the angular DCM theory for a 1D rotation, simplifying the SRBM rotational dynamics to a flywheel to satisfy necessary linearity constraints. The 1D angular DCM is mathematically identical to the linear DCM and defined as an angle which is ahead of the current body rotation based on the angular velocity. This theory is combined into a 3D linear and 1D angular DCM framework, with discussion on the feasibility of simultaneously achieving both sets of objectives. A simulation in MATLAB and hardware results on the TORO humanoid are presented to validate the framework's performance.

10:00-10:05, Paper TuAT7.2	Add to My Program
Finite-Step Capturability and Recursive Feasibility for Bipedal Walking in Constrained Regions

Kumbhar, Shubham	University of Delaware
Kulkarni, Abhijeet Mangesh	University of Delaware
Poulakakis, Ioannis	University of Delaware
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Humanoid Robot Systems Abstract: This paper presents a Model Predictive Control (MPC) formulation for bipedal footstep planning based on the Linear Inverted Pendulum (LIP) model, ensuring recursive feasibility when navigating restricted regions. The proposed approach incorporates capturability and introduces a new constraint that forces the Divergent Component of Motion (DCM) into a finite-step capture region, adjusted between consecutive MPC calls. This constraint enables the MPC to anticipate beyond its prediction horizon, preventing collisions with the walking surface boundaries. We validate the approach through high-fidelity simulations with the bipedal robot Digit, demonstrating recursively feasible MPC footstep planning in restricted regions. Future efforts will extend the approach to general polytopic constraints, thereby facilitating footstep planning in cluttered environments while preserving the MPC's recursive feasibility.

10:05-10:10, Paper TuAT7.3	Add to My Program
Realtime Limb Trajectory Optimization for Humanoid Running through Centroidal Angular Momentum Dynamics

Sovukluk, Sait	TU Wien
Schuller, Robert	German Aerospace Center (DLR)
Englsberger, Johannes	DLR (German Aerospace Center)
Ott, Christian	TU Wien
Keywords: Humanoid and Bipedal Locomotion, Whole-Body Motion Planning and Control, Optimization and Optimal Control Abstract: One of the essential aspects of humanoid robot running is determining the limb-swinging trajectories. During the flight phases, where the ground reaction forces are not available for regulation, the limb swinging trajectories are significant for the stability of the next stance phase. Due to the conservation of angular momentum, improper leg and arm swinging results in highly tilted and unsustainable body configurations at the next stance phase landing. In such cases, the robotic system fails to maintain locomotion independent of the stability of the center of mass trajectories. This problem is more apparent for fast and high flight time trajectories. This paper proposes a real-time nonlinear limb trajectory optimization problem for humanoid running. The optimization problem is tested on two different humanoid robot models, and the generated trajectories are verified using a running algorithm for both robots in a simulation environment.

10:10-10:15, Paper TuAT7.4	Add to My Program
Pitching Motion in a Humanoid Robot Using Human-Inspired Shoulder Elastic Energy and Motor Torque Optimization

Nakazawa, Yuri	Waseda University
Iwamoto, Masaki	Waseda University
Watanabe, Ryuhya	Waseda University
Aoki, Riku	Waseda University
Mineshita, Hiroki	Waseda University
Otani, Takuya	Shibaura Institute of Technology
Kawakami, Yasuo	Waseda University
Lim, Hun-ok	Kanagawa University
Takanishi, Atsuo	Waseda University
Keywords: Modeling and Simulating Humans, Humanoid Robot Systems, Human and Humanoid Motion Analysis and Synthesis Abstract: Humanoid robots that mimic human movement have garnered significant attention in recent years. This study focuses on mimicking the efficient pitching motion of humans by incorporating two main approaches into a humanoid robot: (1) the use of elastic elements to assist joint torque, and (2) the optimization of motor torque to minimize energy consumption. This robot is intended to emulate human physical characteristics, such as mass, link length, and center of gravity, with a particular focus on utilizing the elastic energy generated during shoulder internal and external rotation. A leaf spring is attached in parallel with the motor at the shoulder pitch joint to release the elastic energy stored during shoulder external rotation, thereby assisting internal rotation in a manner similar to human biomechanics. Additionally, motor torque optimization is performed using Fujitsu's Digital Annealer to generate energy-efficient motions. Experiments conducted through simulations and with an actual pitching robot assessed the effectiveness of these technologies in mimicking human-like pitching motion. The results suggest that combining elastic elements with motion optimization techniques enable robots to achieve more efficient human-like movements.

10:15-10:20, Paper TuAT7.5	Add to My Program
Single-Stage Optimization of Open-Loop Stable Limit Cycles with Smooth, Symbolic Derivatives

Saud Ul Hassan, Muhammad	Advanced Micro Devices, Inc
Hubicki, Christian	Florida State University
Keywords: Legged Robots, Optimization and Optimal Control, Passive Walking Abstract: Open-loop stable limit cycles are foundational to legged robotics, providing inherent self-stabilization that minimizes the need for computationally intensive feedback-based gait correction. While previous methods have primarily targeted specific robotic models, this paper introduces a general framework for rapidly generating limit cycles across various dynamical systems, with the flexibility to impose arbitrarily tight stability bounds. We formulate the problem as a single-stage constrained optimization problem and use Direct Collocation to transcribe it into a nonlinear program with closed-form expressions for constraints, objectives, and their gradients. Our method supports multiple stability formulations. In particular, we tested two popular formulations for limit cycle stability in robotics: (1) based on the spectral radius of a discrete return map, and (2) based on the spectral radius of the monodromy matrix, and tested five different constraint-satisfaction formulations of the eigenvalue problem to bound the spectral radius. We compare the performance and solution quality of the various formulations on a robotic swing-leg model, highlighting the Schur decomposition of the monodromy matrix as a method with broader applicability due to weaker assumptions and stronger numerical convergence properties. As a case study, we apply our method on a hopping robot model, generating open-loop stable gaits in under 2 seconds on an Intel Core i7-6700K, while simultaneously minimizing energy consumption even under tight stability constraints.

10:20-10:25, Paper TuAT7.6	Add to My Program
Iterative Periodic Running Control through Swept Angle Adjustment with Modified SLIP Model

Kang, Woosong	Korea Institute of Machinery & Materials(kimm)
Jeong, Jeil	Korea Advanced Institute of Science and Technology
Hong, Jeongwoo	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Yeo, Changmin	DGIST
Park, Dongil	Korea Institute of Machinery and Materials (KIMM)
Oh, Sehoon	DGIST
Keywords: Legged Robots, Dynamics, Humanoid and Bipedal Locomotion Abstract: This paper presents a periodic running control strategy based on a modified Spring-Loaded Inverted Pendulum (SLIP) model to achieve stable running at various velocities. While the traditional SLIP model is valued for its simplicity and intuitive representation of running dynamics, its limitations impede its extension and integration with feedback control systems. To address this, we introduce a novel Quasi-Linearized SLIP model (QLSLIP) that incorporates additional forces in the radial and angular directions to enable stable running across various velocities. This model simplifies the analytical representation of the stance phase and defines the required swept angle for maintaining periodic motion during the flight phase. Using this model, we develop a feedback control system that ensures the stability of QLSLIP-based periodic locomotion, even in the presence of external disturbances. This control framework optimizes trajectories and sustains periodic motion in real-time across diverse scenarios. Additionally, we propose an algorithm to extend this approach to articulated leg mechanisms. The effectiveness of the proposed algorithm is validated through simulations under various conditions, demonstrating improvements in the stability and performance of running.

10:25-10:30, Paper TuAT7.7	Add to My Program
Efficient, Responsive, and Robust Hopping on Deformable Terrain

Lynch, Daniel	Northwestern University
Pusey, Jason	U.S. Army Research Laboratory (ARL)
Gart, Sean	US Army Research Lab
Umbanhowar, Paul	Northwestern University
Lynch, Kevin	Northwestern University
Keywords: Legged Robots, Dynamics, Compliance and Impedance Control, Granular Media Abstract: Legged robot locomotion is hindered by a mismatch between applications featuring deformable substrates, where legs can outperform wheels or treads, and existing planners and controllers, most of which assume flat, rigid substrates. In this study we focus on the effects of plastic ground deformation on the hop-to-hop energy dynamics of a hopping robot driven by a switched-compliance energy injection controller. We derive a hop-to-hop energy return map, and we use experiments and simulations to validate this map for a real robot hopping on a real deformable substrate. By analyzing the map’s fixed points and eigenvalues, we identify constant-fixed-point surfaces in parameter space that suggest it is possible to tune control parameters for efficiency or responsiveness while targeting a desired gait energy level. We also identify conditions for which the map’s fixed points are globally stable, and we characterize the basins of attraction of fixed points when these conditions are not satisfied. We conclude by discussing the implications of this energy map for planning, control, and estimation for efficient, agile, and robust legged locomotion on deformable terrain.


TuAT8 Regular Session, 311	Add to My Program
Medical Robotics 1

Chair: Kuntz, Alan	University of Utah
Co-Chair: Nanayakkara, Thrishantha	Imperial College London

09:55-10:00, Paper TuAT8.1	Add to My Program
Accounting for Hysteresis in the Forward Kinematics of Nonlinearly-Routed Tendon-Driven Continuum Robots Via a Learned Deep Decoder Network

Cho, Brian Y	University of Utah
Esser, Daniel	Vanderbilt University
Thompson, Jordan	University of Utah
Thach, Bao	University of Utah
Webster III, Robert James	Vanderbilt University
Kuntz, Alan	University of Utah
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Deep Learning Methods Abstract: Tendon-driven continuum robots have been gaining popularity in medical applications due to their ability to curve around complex anatomical structures, potentially reducing the invasiveness of surgery. However, accurate modeling is required to plan and control the movements of these flexible robots. Physics-based models have limitations due to unmodeled effects, leading to mismatches between model prediction and actual robot shape. Recently proposed learning-based methods have been shown to overcome some of these limitations but do not account for hysteresis, a significant source of error for these robots. To overcome these challenges, we propose a novel deep decoder neural network that predicts the complete shape of tendon-driven robots using point clouds as the shape representation, conditioned on prior configurations to account for hysteresis. We evaluate our method on a physical tendon-driven robot and show that our network model accurately predicts the robot's shape, significantly outperforming a state-of-the-art physics-based model and a learning-based model that does not account for hysteresis.

10:00-10:05, Paper TuAT8.2	Add to My Program
Graph-Based Spatial Reasoning for Tracking Landmarks in Dynamic Laparoscopic Environments

Zhang, Jie	Huazhong University of Science and Technology
Wang, Yiwei	Huazhong University of Science and Technology
Zhou, Song	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Wan, Chidan	Huazhong University of Science and Technology
Cai, Xiong	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Surgical Robotics: Laparoscopy, Semantic Scene Understanding, Medical Robots and Systems Abstract: Accurate anatomical landmark tracking is crucial yet challenging in laparoscopic surgery due to the changing appearance of landmarks during dynamic tool-anatomy interactions and visual domain shifts between cases. Unlike appearance-based detection methods, this work proposes a novel graph-based approach to reconstruct the entire target landmark area by explicitly modeling the evolving spatial relations over time among scenario entities, including observable regions, surgical tools, and landmarks. Considering tool-anatomy interactions, we present the Tool-Anatomy Interaction Graph (TAI-G), a spatio-temporal graph that captures spatial dependencies among entities, attribute interactions within entities, and temporal dependencies of spatial relations. To mitigate domain shifts, geometric segmentation features are designated as node attributes, representing domain-invariant image information in the graph space. Message passing with attention helps propagate information across TAI-G, enhancing robust tracking by reconstructing landmark data. Evaluated on laparoscopic cholecystectomy, our framework demonstrates effective handling of complex tool-anatomy interactions and visual domain gaps to accurately track landmarks, showing promise in enhancing the stability and reliability of intricate surgical tasks.

10:05-10:10, Paper TuAT8.3	Add to My Program
A Robust Deep Reinforcement Learning Framework for Image-Based Autonomous Guidewire Navigation

Yoo, Sangbaek	KAIST
Kwon, Hojun	KAIST
Choi, Jaesoon	Asan Medical Center
Chang, Dong Eui	KAIST
Keywords: Reinforcement Learning, Medical Robots and Systems, Vision-Based Navigation Abstract: Percutaneous coronary intervention (PCI) involves the insertion of a catheter or guidewire into a blood vessel of a patient, which poses a problem as a doctor is exposed to radiation during the procedure. The use of assistive robots has been proposed to address this issue. Furthermore, recent research is progressing toward complete autonomous navigation using deep reinforcement learning (DRL). Nevertheless, existing algorithms face limitations when operating in numerous unseen environments close to real PCI. This study proposes a robust DRL framework for image-based guidewire navigation to overcome the limitation. We introduce a subtasks strategy and domain randomization to improve robustness in various environments. The subtasks strategy consistently addresses complex global tasks by breaking them into subtasks designed using local maps, allowing them to be robustly solved by a single agent. Domain randomization is applied to handle real PCI issues, including variations in vessel geometry, guidewire deformation, and camera settings. By integrating the two novel methods, our DRL algorithm demonstrates superior performance compared to existing methods across various challenging simulation and phantom environments, validating its effectiveness in real-world scenarios. A video of our experiment is available at url{https://youtu.be/93Q88gESzOY}.

10:10-10:15, Paper TuAT8.4	Add to My Program
CTS: A Consistency-Based Medical Image Segmentation Model

Zhang, Kejia	Harbin Engineering University
Zhang, Lan	Harbin Engineering University
Pan, Haiwei	Harbin Engineering University
Keywords: Deep Learning Methods, Computer Vision for Medical Robotics Abstract: 在医学图像分割任务中，扩散模型具有显示出巨大的潜力。然而，主流扩散模型显示出包括多次采样在内的缺点时间和慢速预测结果。最近，作为独立生成网络，一致性模型具有解决了现有问题。与扩散模型相比，一致性模型可以将采样时间缩短到一次，不仅可以实现类似的生成效果，而且显著加快训练和预测速度。但是，它们不适合图像分割任务。同时，它们在医学成像中的应用 Field 尚未接受调查。因此，本研究采用一致性模型执行医学影像分割任务，设计多尺度特征信号监管模式及损失功能引导实现模型收敛。实验表明， CTS 模型能够获得更好的医学图像分Ó

10:15-10:20, Paper TuAT8.5	Add to My Program
An Adversarial Learning Framework for Reliable Myoelectric Force Estimation under Fatigue

Pan, Huiming	Shanghai Jiao Tong University
Li, Dongxuan	Shanghai Jiao Tong University
Chen, Chen	Shanghai Jiao Tong University
Jiang, Shuo	Tongji University
Shull, Peter B.	Shanghai Jiao Tong University
Keywords: Prosthetics and Exoskeletons, Deep Learning Methods, Force and Tactile Sensing Abstract: Electromyography (EMG) signals are widely used as control inputs for myoelectric exoskeletons. However, muscle fatigue, which can result from prolonged use or heavy loads, significantly affects muscle activation patterns, leading to reduced estimation accuracy. To address this challenge, we propose an adversarial learning framework to enhance grip force estimation under fatigue conditions. The framework consists of three key components: a domain-invariant feature extractor to mitigate domain shifts between non-fatigue and fatigue states, a force estimator to predict grip forces from these domain-invariant features, and a domain discriminator to distinguish between the two domains. The proposed method was evaluated on a dataset collected from eight participants performing gripping tasks under both non-fatigue and fatigue conditions, during which high-density EMG signals and grip forces were recorded simultaneously. Experimental results demonstrated that our method significantly reduced the root mean square error (RMSE) from 0.264 to 0.127, outperforming a baseline model consisting of only the feature extractor and force estimator (p < 0.01). Additionally, the proposed approach exhibited consistent performance across all participants, highlighting its robustness and generalizability. These findings suggest that the proposed adversarial learning framework effectively enhances grip force estimation accuracy under muscle fatigue, offering a promising solution for improving the reliability and usability of myoelectric exoskeletons.

10:20-10:25, Paper TuAT8.6	Add to My Program
An Origami-Inspired Endoscopic Capsule with Tactile Perception for Early Tissue Anomaly Detection

Ge, Yukun	Imperial College London
Zong, Rui	Imperial College London
Chen, Xiaoshuai	Imperial College London
Nanayakkara, Thrishantha	Imperial College London
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing, Medical Robots and Systems Abstract: Video Capsule Endoscopy (VCE) is currently one of the most effective methods for detecting intestinal diseases. However, it is challenging to detect early-stage small nodules with this method because they lack obvious color or shape features. In this letter, we present a new origami capsule endoscope to detect early small intestinal nodules using tactile sensing. Four soft tactile sensors made out of piezoresistive material feed four channels of phase-shifted data that are processed using a particle filter. The particle filter uses an importance assignment template designed using experimental data from six known sizes ofnodules. Moreover, the proposed capsule can use shape changes to passively move forward or backward under peristalsis, enabling it to reach any position in the intestine for detection. Experimental results show that the proposed capsule can detect nodules of more than 3mm diameter with 100% accuracy.

10:25-10:30, Paper TuAT8.7	Add to My Program
Exploring the Limitations and Implications of the JIGSAWS Dataset for Robot-Assisted Surgery

Hendricks, Antonio	Univeristy of Florida
Panoff, Maximillian	University of Florida
Xiao, Kaiwen	University of Florida
Wang, Zhaoqi	University of Florida
Wang, Shuo	University of Florida
Bobda, Christophe	University of Florida
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Performance Evaluation and Benchmarking Abstract: The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) dataset has proven to be a foundational component of modern work on the skill analysis of robotic surgeons. In particular, methods using either the system's kinematics or video data have shown to be able to classify operators into distinct experience levels, and recent approaches have even ventured to recover numeric skill ratings assigned to assessment sessions. Although prior works have achieved positive results in these directions, challenges still remain with classification across all three levels of operator training amounts and objective skill rating regressions. To this end, we perform the first statistical analysis of the dataset itself and compile the results here. We find limited relationships between the amount of experience or training of an operator and their performance in JIGSAWS. Moreover, as operator-side kinematics have well-known relationships with their skill, previous works have used both robot and operator-side kinematics to classify operator skill; we find the first explicit relationships between pure robot-side kinematics and surgical performance. Finally, we analyze the robotic kinematic trends associated with high performance in JIGSAWS tasks and present how they may be used as indicators in human and automated surgeon training.


TuAT9 Regular Session, 312	Add to My Program
Motion Planning 1

Chair: Alonso-Mora, Javier	Delft University of Technology
Co-Chair: Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)

09:55-10:00, Paper TuAT9.1	Add to My Program
Path Planning Using Instruction-Guided Probabilistic Roadmaps

Bao, Jiaqi	CyberAgent
Yonetani, Ryo	CyberAgent
Keywords: Integrated Planning and Learning, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: This work presents a novel data-driven path planning algorithm named Instruction-Guided Probabilistic Roadmap (IG-PRM). Despite the recent development and widespread use of mobile robot navigation, the safe and effective travels of mobile robots still require significant engineering effort to take into account the constraints of robots and their tasks. With IG-PRM, we aim to address this problem by allowing robot operators to specify such constraints through natural language instructions, such as ``aim for wider paths'' or ``mind small gaps''. The key idea is to convert such instructions into embedding vectors using large-language models (LLMs) and use the vectors as a condition to predict instruction-guided cost maps from occupancy maps. By constructing a roadmap based on the predicted costs, we can find instruction-guided paths via the standard shortest path search. Experimental results demonstrate the effectiveness of our approach on both synthetic and real-world indoor navigation environments.

10:00-10:05, Paper TuAT9.2	Add to My Program
Pushing through Clutter with Movability Awareness of Blocking Obstacles

Weeda, Joris J.	TU Delft
Bakker, Saray	Delft University of Technology
Chen, Gang	Delft University of Technology
Alonso-Mora, Javier	Delft University of Technology
Keywords: Motion and Path Planning, Collision Avoidance, Integrated Planning and Control Abstract: Navigation Among Movable Obstacles (NAMO) poses a challenge for traditional path-planning methods when obstacles block the path, requiring push actions to reach the goal. We propose a framework that enables movability-aware planning to overcome this challenge without relying on explicit obstacle placement. Our framework integrates a global Semantic Visibility Graph and a local Model Predictive Path Integral (SVG-MPPI) approach to efficiently sample rollouts, taking into account the continuous range of obstacle movability. A physics engine is adopted to simulate the interaction result of the rollouts with the environment, and generate trajectories with lower contact force. Qualitative and quantitative experiments suggest that SVG-MPPI outperforms existing paradigm that uses only binary movability for planning, achieving higher success rates with reduced cumulative contact forces. Our code is available at: https://github.com/tud-amr/SVG-MPPI

10:05-10:10, Paper TuAT9.3	Add to My Program
Improving Efficiency in Path Planning: Tangent Line Decomposition Algorithm

Tian, Yu	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Motion and Path Planning, Collision Avoidance Abstract: This paper introduces a tangent line decomposition (TLD) algorithm that efficiently finds collision-free paths close to optimal in both 2D and 3D environment. Compared with the existing visibility line-based algorithms, the proposed algorithm innovatively proposed the concept of tangent line decomposition, which decomposes complicated planning into many simple steps. For each step, only one key obstacle is taken into consideration. Besides, instead of constructing a complete graph, a best-first search algorithm is used to avoid searching redundant edges. The path planned by the algorithm is not the optimal path. However, following the idea of the informed RRT* algorithm, the path length planned by TLD can be used as a precondition for other optimal algorithms. In this way, the overall efficiency can be significantly improved. The simulations show that the proposed methods outperform existing methods regarding planning efficiency and solution quality.

10:10-10:15, Paper TuAT9.4	Add to My Program
Gradient Guided Search for Aircraft Contingency Landing Planning

Tekaslan, Huseyin Emre	Virginia Tech
Atkins, Ella	University of Michigan
Keywords: Motion and Path Planning, Aerial Systems: Applications, Aerial Systems: Perception and Autonomy Abstract: This paper presents a three-dimensional discrete search path planner for fixed-wing aircraft emergency landing planning that manages state-space complexity by incorporating cost gradients to assure descent flight path angle and runway heading alignment constraints are met. Our approach incorporates steady wind and maximizes margin from flight envelope boundaries to accommodate wind variation in a manner commensurate with a loss of thrust condition. A novel multi-objective cost function that combines gradient-based path guidance and population risk metrics is implemented to efficiently enable discrete search to find a robust solution. The proposed method is demonstrated through use cases with population data for a region of Long Island, New York that highlight our algorithm's effectiveness.

10:15-10:20, Paper TuAT9.5	Add to My Program
Search-Based Path Planning in Interactive Environments among Movable Obstacles

Ren, Zhongqiang	Shanghai Jiao Tong University
Suvonov, Bunyod	Shanghai Jiao Tong University
Chen, Guofei	Carnegie Mellon University
He, Botao	University of Maryland
Liao, Yijie	Shanghai Jiao Tong University
Fermuller, Cornelia	University of Maryland
Zhang, Ji	Carnegie Mellon University
Keywords: Motion and Path Planning Abstract: This paper investigates Path planning Among Movable Obstacles (PAMO), which seeks a minimum cost collision-free path among static obstacles from start to goal while allowing the robot to push away movable obstacles (i.e., objects) along its path when needed. To develop planners that are complete and optimal for PAMO, the planner has to search a giant state space involving both the location of the robot as well as the locations of the objects, which grows exponentially with respect to the number of objects. This paper leverages a simple yet under-explored idea that, only a small fraction of this giant state space needs to be searched during planning as guided by a heuristic, and most of the objects far away from the robot are intact, which thus leads to runtime efficient algorithms. Based on this idea, this paper introduces two PAMO formulations, i.e., bi-objective and resource constrained problems in an occupancy grid, and develops PAMO, a planning method with completeness and solution optimality guarantees, to solve the two problems. We then further extend PAMO to hybrid-state PAMO* to plan in continuous spaces with high-fidelity interaction between the robot and the objects. Our results show that, PAMO* can often find optimal solutions within a second in cluttered maps with up to 400 objects.

10:20-10:25, Paper TuAT9.6	Add to My Program
Neural Encodings for Energy-Efficient Motion Planning

Shah, Deval	The University of British Columbia
Zhao, Jocelyn	University of British Columbia
Aamodt, Tor Michael	University of British Columbia
Keywords: Motion and Path Planning, Energy and Environment-Aware Automation, Deep Learning Methods Abstract: Neural motion planners can increase motion plan- ning quality and, by reducing collision detection computations, improve runtime. However, when profiled on an accelerator-rich hardware system, neural planning contributes to more than 50% of the runtime, and 33% of the computation energy consumption, motivating the design of compute- and energy-efficient neural planners. In this work, we propose a neural planner using Binary Encoded Labels (BEL), where a set of binary classifiers are used instead of a typical regression network. Compared to conventional regression-based neural planners, the proposed BEL neural planner reduces neural planning (inference) computation and collision detection checks while maintaining equal or higher motion planning success rate across various motion planning benchmarks. This computation reduction can improve the energy efficiency of neural planning by 1.4x−21.4x. Finally, we demonstrate the trade-offs between collision detection and neural planning computation to maximize energy efficiency for different hardware configurations.

10:25-10:30, Paper TuAT9.7	Add to My Program
Rigid Body Path Planning Using Mixed-Integer Linear Programming

Yu, Mingxin	Massachusetts Institute of Technology
Fan, Chuchu	Massachusetts Institute of Technology
Keywords: Formal Methods in Robotics and Automation, Motion and Path Planning Abstract: Navigating rigid body objects through crowded environments can be challenging, especially when narrow passages are presented. Existing sampling-based planners and optimization-based methods like mixed integer linear programming (MILP) formulations, suffer from limited scalability with respect to either the size of the workspace or the number of obstacles. In order to address the scalability issue, we propose a three-stage algorithm that first generates a graph of convex polytopes in the workspace free of collision, then poses a large set of small MILPs to generate viable paths between polytopes, and finally queries a pair of start and end configurations for a feasible path online. The graph of convex polytopes serves as a decomposition of the free workspace and the number of decision variables in each MILP is limited by restricting the subproblem within two or three free polytopes rather than the entire free region. Our simulation results demonstrate shorter online computation time compared to baseline methods and scales better with the size of the environment and tunnel width than sampling-based planners in both 2D and 3D environments.


TuAT10 Regular Session, 313	Add to My Program
Multi-Robot Swarms 1

Chair: Lu, Qi	The University of Texas Rio Grande Valley
Co-Chair: Hauert, Sabine	University of Bristol

09:55-10:00, Paper TuAT10.1	Add to My Program
Strain-Coordinated Formation, Migration, and Encapsulation Behaviors in a Tethered Robot Collective

Cutler, Sadie	Cornell University
Ma, Danna	Cornell University
Petersen, Kirstin Hagelskjaer	Cornell University
Keywords: Distributed Robot Systems, Robust/Adaptive Control, Sensor-based Control Abstract: Tethers are an underutilized tool in multi-robot systems: tethers can provide power, facilitate retrieval and sensing, and be used to manipulate and gather objects. Starting with the simplest possible configuration, our work explores how agents linked in series by flexible, passive, fixed-length tethers, can use those tethers as sensors to achieve distributed formation control. In this study, we extend upon previous work to show the applicability of strain-coordinated formation control for encapsulation and migration along a global gradient as well as the trade-offs between formation control and taxis in an obstacle-laden environment. Our results indicate significant potential for tethered robot collectives: versatile behaviors that can work on simple, resource-constrained robots or serve as a fallback mechanism in case more sophisticated means of coordination fail.

10:00-10:05, Paper TuAT10.2	Add to My Program
Deep Learning-Enhanced Visual Monitoring in Hazardous Underwater Environments with a Swarm of Micro-Robots

Chen, Shuang	Durham University
He, Yifeng	The University of Manchester
Lennox, Barry	The University of Manchester
Arvin, Farshad	Durham University
Atapour-Abarghouei, Amir	Durham University
Keywords: Robotics in Hazardous Fields, Deep Learning for Visual Perception, Data Sets for Robotic Vision Abstract: Long-term monitoring and exploration of extreme environments, such as underwater storage facilities, is costly, labor-intensive, and hazardous. Automating this process with low-cost, collaborative robots can greatly improve efficiency. These robots capture images from different positions, which must be processed simultaneously to create a spatio-temporal model of the facility. In this paper, we propose a novel approach that integrates data simulation, a multi-modal deep learning network for coordinate prediction, and image reassembly to address the challenges posed by environmental disturbances causing drift and rotation in the robots’ positions and orientations. Our approach enhances the precision of alignment in noisy environments by integrating visual information from snapshots, global positional context from masks, and noisy coordinates. We validate our method through extensive experiments using synthetic data that simulate real-world robotic operations in underwater settings. The results demonstrate very high coordinate prediction accuracy and plausible image assembly, indicating the real-world applicability of our approach. The assembled images provide clear and coherent views of the underwater environment for effective monitoring and inspection, showcasing the potential for broader use in extreme settings, further contributing to improved safety, efficiency, and cost reduction in hazardous field monitoring.

10:05-10:10, Paper TuAT10.3	Add to My Program
CapBot: Enabling Battery-Free Swarm Robotics

Liu, Mengyao	KU Leuven
Deferme, Lowie	KU Leuven
Van Eyck, Tom	KU Leuven
Yang, Fan	KU Leuven
Abadie, Alexandre	Inria
Alvarado-Marin, Said	INRIA
Maksimovic, Filip	INRIA
Miyauchi, Genki	The University of Sheffield
Jayakumar, Jessica	University of Sheffield
Talamali, Mohamed S.	University of Sheffield
Watteyne, Thomas	Inria
Gross, Roderich	Technical University of Darmstadt
Hughes, Danny	KU Leuven
Keywords: Swarm Robotics, Embedded Systems for Robotic and Automation, Hardware-Software Integration in Robotics Abstract: Swarm robotics focuses on designing and coordinating large groups of relatively simple robots to perform tasks in a decentralised and collective manner. The swarm provides a resilient and flexible solution for many applications. However, contemporary swarm robots have a significant power problem in that secondary (i.e. rechargeable) batteries are slow to charge and offer lifetimes of only a few years, increasing maintenance costs and pollution due to battery replacement.We imagine a different future, wherein battery free robots powered by supercapacitors can be recharged in seconds, offer long-life autonomous operation and can rapidly pass charge between one another using trophallaxis. In pursuit of this vision, we contribute the CapBot, a battery-free swarm robot equipped with Mecanum wheels, a Cortex M4F application processor and Bluetooth Low Energy networking. The CapBot fully recharges in 16 s, offers 51 min of autonomous operation at top speed, and can transfer up to 50% of its available charge to a peer via trophallaxis in under 20 s. The CapBot is fully open source and all software and hardware source is available online.

10:10-10:15, Paper TuAT10.4	Add to My Program
Express Yourself: Enabling Large-Scale Public Events Involving Multi-Human-Swarm Interaction for Social Applications with MOSAIX

Alhafnawi, Merihan	Princeton University
Gomez-Gutierrez, Maca	We the Curios
Hunt, Edmund Robert	University of Bristol
Lemaignan, Séverin	PAL Robotics
O'Dowd, Paul Jason	University of Bristol
Hauert, Sabine	University of Bristol
Keywords: Swarm Robotics, Social HRI, Art and Entertainment Robotics Abstract: Robot swarms have the potential to help groups of people with social tasks, given their ability to scale to large numbers of robots and users. Developing multi-human-swarm interaction is therefore crucial to support multiple people interacting with the swarm simultaneously - which is an area that is scarcely researched, unlike single-human, single-robot or single-human, multi-robot interaction. Moreover, most robots are still confined to laboratory settings. In this paper, we present our work with MOSAIX, a swarm of robot Tiles, that facilitated ideation at a science museum. 63 robots were used as a swarm of smart sticky notes, collecting input from the public and aggregating it based on themes, providing an evolving visualization tool that engaged visitors and fostered their participation. Our contribution lies in creating a large-scale (63 robots and 294 attendees) public event, with a completely decentralized swarm system in real-life settings. We also discuss learnings we obtained that might help future researchers create multi-human-swarm interaction with the public.

10:15-10:20, Paper TuAT10.5	Add to My Program
MochiSwarm: A Testbed for Robotic Micro-Blimps in Realistic Environments

Xu, Jiawei	Lehigh University
Vu, Thong	Lehigh University
S. D'Antonio, Diego	Lehigh University
Saldaña, David	Lehigh University
Keywords: Software-Hardware Integration for Robot Systems, Aerial Systems: Applications, Swarm Robotics Abstract: Efficient energy management and scalability are critical for aerial robots in tasks such as pickup-and-delivery and surveillance. This paper introduces MochiSwarm, an open-source testbed of light-weight micro robotic blimps designed for multi-robot operation without external localization. We propose a modular system architecture that integrates adaptable hardware, a flexible software framework, and a detachable perception module. The hardware is designed to allow for rapid modifications and sensor integration, while the software supports multiple actuation models and robust communication between a base station and multiple blimps. We showcase a differential-drive module as an example, in which autonomy is enabled by visual servoing using the perception module. A case study of pickup-and-delivery tasks with up to 12 blimps highlights the autonomy of the MochiSwarm without relying on external infrastructures.

10:20-10:25, Paper TuAT10.6	Add to My Program
Exploring Unstructured Environments Using Minimal Sensing on Cooperative Nano-Drones

Arias-Perez, Pedro	Universidad Politécnica De Madrid
Gautam, Alvika	Texas a & M University
Fernandez-Cortizas, Miguel	Universidad Politécnica De Madrid
Perez Saura, David	Computer Vision and Aerial Robotics Group (CVAR), Universidad Po
Saripalli, Srikanth	Texas A&M
Campoy, Pascual	Computer Vision & Aerial Rootics Group. Universidad Politécnica
Keywords: Aerial Systems: Perception and Autonomy, Micro/Nano Robots, Multi-Robot Systems Abstract: Recent advances have improved autonomous navigation and mapping under payload constraints, but current multi-robot inspection algorithms are unsuitable for nano-drones, due to their need for heavy sensors and high computational resources. To address these challenges, we introduce ExploreBug, a novel hybrid frontier range-bug algorithm designed to handle limited sensing capabilities for a swarm of nano-drones. This system includes three primary components: a mapping subsystem, an exploration subsystem, and a navigation subsystem. Additionally, an intra-swarm collision avoidance system is integrated to prevent collisions between drones. We validate the efficacy of our approach through extensive simulations and real-world exploration experiments, involving up to seven drones in simulations and three in real-world settings, across various obstacle configurations and with a maximum navigation speed of 0.75 m/s. Our tests prove that the algorithm efficiently completes exploration tasks, even with minimal sensing, across different swarm sizes and obstacle densities. Furthermore, our frontier allocation heuristic ensures an equal distribution of explored areas and paths traveled by each drone in the swarm. We publicly release the source code of the proposed system to foster further developments in mapping and exploration using autonomous nano drones.

10:25-10:30, Paper TuAT10.7	Add to My Program
Continuous Sculpting: Persistent Swarm Shape Formation Adaptable to Local Environmental Changes

Curtis, Andrew	Northwestern
Yim, Mark	University of Pennsylvania
Rubenstein, Michael	Northwestern University
Keywords: Swarms, Path Planning for Multiple Mobile Robots or Agents, Distributed Robot Systems, Shape Formation Abstract: Despite their growing popularity, swarms of robots remain limited by the operating time of each individual. We present algorithms which allow a human to sculpt a swarm of robots into a shape that persists in space perpetually, independent of onboard energy constraints such as batteries. Robots generate a path through a shape such that robots cycle in and out of the shape. Robots inside the shape react to human initiated changes and adapt the path through the shape accordingly. Robots outside the shape recharge and return to the shape so that the shape can persist indefinitely. The presented algorithms communicate shape changes throughout the swarm using message passing and robot motion. These algorithms enable the swarm to persist through any arbitrary changes to the shape. We describe these algorithms in detail and present their performance in simulation and on a swarm of mobile robots. The result is a swarm behavior more suitable for extended duration, dynamic shape-based tasks in applications such as entertainment, agriculture, and emergency response.


TuAT11 Regular Session, 314	Add to My Program
Calibration 1

Chair: Mueller, Andreas	Johannes Kepler University
Co-Chair: Lee, Min Cheol	Pusan National University

09:55-10:00, Paper TuAT11.1	Add to My Program
Kinematic Calibration of a Redundant Robot in Closed-Loop System Using Indicated Competitive Swarm Method

Kim, Jaehyung	Pusan National Univ
Lee, Min Cheol	Pusan National University
Keywords: Calibration and Identification, Redundant Robots, Kinematics Abstract: Previous calibration techniques often relied on specialized end-effector tracking devices, such as a laser tracker, which can be expensive and impractical in specific environments. Furthermore, research on the calibration of redundant manipulators has been relatively scarce compared to non-redundant counterparts. To overcome these limitations, this article introduces a novel method for kinematic calibration of a damaged redundant serial robot, employing an indicated competitive swarm optimization with a finite-screw deviation model. The proposed kinematic calibration method utilizes a kinematic closed-loop method, which identifies an axis deviation without using expensive end-effector tracking equipment. Moreover, a competitive-swarm-inspired optimization model is introduced to efficiently identify axis deviations, significantly reducing the required calibration points compared to prior studies and thereby facilitating calibration for redundant manipulators. Both simulation and experiment were conducted to validate the proposed method using a seven-degree-of-freedom redundant serial robot. The results demonstrate the proposed calibration method's effectiveness and practicality, which can be easily applied for a redundant robot calibration.

10:00-10:05, Paper TuAT11.2	Add to My Program
KFCalibNet: A KansFormer-Based Self-Calibration Network for Camera and LiDAR

Xu, Zejing	Traffic Control Technology Co., Ltd
Liu, Yiqing	University of Birmingham
Gao, Ruipeng	Beijing Jiaotong University
Tao, Dan	Beijing Jiaotong University
Qi, Peng	Beijing Jiaotong University
Zhao, Ning	University of Birmingham
Fu, Zhe	Traffic Control Technology Co., Ltd., Beijing, China
Keywords: Calibration and Identification, Sensor Fusion, Deep Learning Methods Abstract: In autonomous driving and robotic navigation, multi-sensor fusion technology has become increasingly mainstream, with precise sensor calibration as its foundation. Traditional calibration methods rely on manual effort or specific targets, limiting adaptability to complex environments. Learning-based calibration methods still face challenges, such as insufficient overlap between the fields of view (FoV) of multiple sensors and suboptimal cross-modal feature association, which hinder accurate parameter regression. Unlike traditional CNN-based networks, we propose a KansFormer-based self-Calibration Network for camera and LiDAR (KFCalibNet) that replaces fixed activation functions and linear transformations with learnable nonlinear activation functions. This enables the extraction of more fine-grained features from both image and point cloud, significantly enhancing the network's robustness in scenarios with limited FoV overlap. We also employ a multihead attention (MHA) module to compute correlations between image and point cloud features, significantly enhancing cross-modal feature association. To reduce learning complexity, we designed KansFormer with FastKAN as the feedforward network, enabling deep fusion and regression of fine-grained cross-modal features for accurate extrinsic calibration. KFCalibNet achieves an absolute average calibration error of 0.0965 cm in translation and 0.0234° in rotation on the KITTI Odometry dataset, outperforming existing state-of-the-art calibration methods. Moreover, its accuracy and generalization capability have been validated across multiple real-world railway lines.

10:05-10:10, Paper TuAT11.3	Add to My Program
Inducing Matrix Sparsity Bias for Improved Dynamic Identification of Parallel Kinematic Manipulators Using Deep Learning

Lahoud, Marcel	Italian Institute of Technology
Gnad, Daniel	Johannes Kepler University Linz
Marchello, Gabriele	Istituto Italiano Di Tecnologia
D'Imperio, Mariapaola	Istituto Italiano Di Tecnologia
Mueller, Andreas	Johannes Kepler University
Cannella, Ferdinando	Istituto Italiano Di Tecnologia
Keywords: Dynamics, Deep Learning Methods, Calibration and Identification Abstract: Among the many challenges of parallel kinematic manipulators, achieving high-speed and accurate control remains crucial. Estimating their dynamic properties is essential for designing precise and efficient control schemes. Conventional methods for dynamic model identification have been effective, though deep learning approaches have historically faced limitations due to data inefficiencies. However, recent advancements in physics-informed neural networks (PINNs) offer a way to improve both control and the extraction of interpretable physical properties from these robots. In this work, we propose and validate a PINN-based dynamic model for a Delta parallel robot, specifically the ABB IRB 360-6/1600. Our approach incorporates known physical properties, such as mass matrix sparsity, to improve accuracy and computational efficiency in dynamic model identification. To the best of our knowledge, this is the first study applying PINNs to model parallel robots. The method is validated experimentally, and its performance is compared to a validated identification technique for physically consistent identification, demonstrating the effectiveness of this approach for real-world applications in parallel robots.

10:10-10:15, Paper TuAT11.4	Add to My Program
Infield Self-Calibration of Intrinsic Parameters for Two Rigidly Connected IMUs

Huang, Can	XREAL, Inc
Lai, Wenqian	XREAL
Guo, Ruonan	XREAL
Wu, Kejian	XREAL
Keywords: Calibration and Identification, Sensor Fusion, Localization Abstract: This paper presents a study on the infield self-calibration of two rigidly connected IMUs' intrinsic parameters, without the aid of any external sensors, equipment, or specialized procedures. Specifically, we consider the calibration of gyroscope biases, gyroscope scale factors, and accelerometer biases, using only IMU data and known extrinsics between the two IMUs. We focus on the observability analysis of this system, and show that all gyroscope intrinsic parameters and a portion of accelerometer biases are observable, with information from both IMUs and sufficient motion. Moreover, we identify the additional unobservable directions in the intrinsic parameters that arise from various degenerate motions. Finally, we validate our observability findings through numerical simulations, and assess our system's calibration accuracy using real-world data.

10:15-10:20, Paper TuAT11.5	Add to My Program
PlaneHEC: Efficient Hand-Eye Calibration for Multi-View Robotic Arm Via Any Point Cloud Plane Detection

Wang, Ye	Xi'an Jiaotong University
Jing, Haodong	Xi'an Jiaotong University
Liao, Yang	Xi'an Jiaotong University
Ma, Yongqiang	Xi'an Jiaotong University
Zheng, Nanning	Xi'an Jiaotong University
Keywords: Calibration and Identification, RGB-D Perception, Perception for Grasping and Manipulation Abstract: Hand-eye calibration is an important task in vision-guided robotic systems and is crucial for determining the transformation matrix between the camera coordinate system and the robot end-effector. Existing methods, for multi-view robotic systems, usually rely on accurate geometric models or manual assistance, generalize poorly, and can be very complicated and inefficient. Therefore, in this study, we propose PlaneHEC, a generalized hand-eye calibration method that does not require complex models and can be accomplished using only depth cameras, which achieves the optimal and fastest calibration results using arbitrary planar surfaces like walls and tables. PlaneHEC introduces hand-eye calibration equations based on planar constraints, which makes it strongly interpretable and generalizable. PlaneHEC also uses a comprehensive solution that starts with closed-form solution and improves it with iterative optimization, which greatly improves accuracy. We comprehensively evaluated the performance of PlaneHEC in both simulated and real-world environments and compared the results with other point-cloud-based calibration methods, proving its superiority. Our approach achieves universal and fast calibration with an innovative design of computational models, providing a strong contribution to the development of multi-agent system and embodied intelligence.

10:20-10:25, Paper TuAT11.6	Add to My Program
Bayesian Optimal Experimental Design for Robot Kinematic Calibration

Das, Ersin	Caltech
Touma, Thomas	Caltech
Burdick, Joel	California Institute of Technology
Keywords: Calibration and Identification, Kinematics Abstract: This paper develops a Bayesian optimal experimental design for robot kinematic calibration on {mathbb{S}^3 !times! mathbb{R}^3}. Our method builds upon a Gaussian process approach that incorporates a geometry-aware kernel based on Riemannian Mat'ern kernels over {mathbb{S}^3}. To learn the forward kinematics errors via Bayesian optimization with a Gaussian process, we define a geodesic distance-based objective function. Pointwise values of this function are sampled via noisy measurements taken using fiducial markers on the end-effector using a camera and computed pose with the nominal kinematics. The corrected Denavit-Hartenberg parameters are obtained using an efficient quadratic program that operates on the collected data sets. The effectiveness of the proposed method is demonstrated via simulations and calibration experiments on NASA's ocean world lander autonomy testbed (OWLAT).

10:25-10:30, Paper TuAT11.7	Add to My Program
Automatic Target-Less Camera-LiDAR Calibration from Motion and Deep Point Correspondences

Petek, Kürsat	University of Freiburg
Vödisch, Niclas	University of Freiburg
Meyer, Johannes	University of Freiburg
Cattaneo, Daniele	University of Freiburg
Valada, Abhinav	University of Freiburg
Burgard, Wolfram	University of Technology Nuremberg
Keywords: Calibration and Identification, Deep Learning Methods, Sensor Fusion Abstract: Sensor setups of robotic platforms commonly include both camera and LiDAR as they provide complementary information. However, fusing these two modalities typically requires a highly accurate calibration between them. In this paper, we propose MDPCalib which is a novel method for camera-LiDAR calibration that requires neither human supervision nor any specific target objects. Instead, we utilize sensor motion estimates from visual and LiDAR odometry as well as deep learning-based 2D-pixel-to-3D-point correspondences that are obtained without in-domain retraining. We represent camera-LiDAR calibration as an optimization problem and minimize the costs induced by constraints from sensor motion and point correspondences. In extensive experiments, we demonstrate that our approach yields highly accurate extrinsic calibration parameters and is robust to random initialization. Additionally, our approach generalizes to a wide range of sensor setups, which we demonstrate by employing it on various robotic platforms including a self-driving perception car, a quadruped robot, and a UAV. To make our calibration method publicly accessible, we release the code on our project website at https://calibration.cs.uni-freiburg.de.


TuAT12 Regular Session, 315	Add to My Program
Identifcation and Estimation for Legged Robots

Chair: Boularias, Abdeslam	Rutgers University
Co-Chair: Bekris, Kostas E.	Rutgers, the State University of New Jersey

09:55-10:00, Paper TuAT12.1	Add to My Program
Legged Robot State Estimation with Invariant Extended Kalman Filter Using Neural Measurement Network

Youm, Donghoon	Korea Advanced Institute of Science and Technology
Oh, Hyunsik	Korea Advanced Institute of Science and Technology
Choi, Suyoung	KAIST
Kim, HyeongJun	Korea Advanced Institute of Science and Technology
Jeon, Seunghun	KAIST
Hwangbo, Jemin	Korean Advanced Institute of Science and Technology
Keywords: Legged Robots, Deep Learning Methods Abstract: This paper introduces a novel proprioceptive state estimator for legged robots that combines model-based filters with deep neural networks. In environments where vision systems are not reliable, proprioceptive state estimators become indispensable. Traditionally, proprioceptive state estimators are based on model-based approaches, which rely solely on contact foot kinematics as measurements. In contrast, learning-based approaches have obtained new measurements, such as displacement and covariance, by leveraging real-world data in a supervised manner. In this work, we develop a state estimation framework that trains a neural measurement network (NMN) to estimate the base's linear velocity and foot contact probability, which are then employed as measurements in an invariant extended Kalman filter. Our approach relies solely on simulation data for training, as it allows us to obtain extensive data easily. We address the sim-to-real gap by adapting existing learning techniques and regularization. To validate our proposed method, we conduct hardware experiments using a quadruped robot on four types of terrain: flat, debris, soft, and slippery. In our experiments, the proposed method demonstrates significant improvements over the model-based state estimator, achieving an average reduction in Absolute Trajectory Error (ATE) by 61.8% for position and 8.5% for velocity.

10:00-10:05, Paper TuAT12.2	Add to My Program
Physically-Consistent Parameter Identification of Robots in Contact

Khorshidi, Shahram	University of Bonn
Elnagdi, Murad	University of Bonn
Nederkorn, Benno	Technical University of Munich
Bennewitz, Maren	University of Bonn
Khadiv, Majid	Technical University of Munich
Keywords: Legged Robots, Model Learning for Control, Calibration and Identification Abstract: Accurate inertial parameter identification is crucial for the simulation and control of robots encountering intermittent contacts with the environment. Classically, robots' inertial parameters are obtained from CAD models that are not precise (and sometimes not available, e.g., Spot from Boston Dynamics), hence requiring identification. To do that, existing methods require access to contact force measurement, a modality not present in modern quadruped and humanoid robots. This paper presents an alternative technique that utilizes joint current/torque measurements —a standard sensing modality in modern robots— to identify inertial parameters without requiring direct contact force measurements. By projecting the whole-body dynamics into the null space of contact constraints, we eliminate the dependency on contact forces and reformulate the identification problem as a linear matrix inequality that can handle physical and geometrical constraints. We compare our proposed method against a common black-box identification method using a deep neural network and show that incorporating physical consistency significantly improves the sample efficiency and generalizability of the model. Finally, we validate our method on the Spot quadruped robot across various locomotion tasks, showcasing its accuracy and generalizability in real-world scenarios over different gaits.

10:05-10:10, Paper TuAT12.3	Add to My Program
Contact Force Estimation for a Leg-Wheel Transformable Robot with Varying Contact Points

Shen, Yi-Syuan	National Taiwan University
Yu, Wei-Shun	National Taiwan University
Lin, Pei-Chun	National Taiwan University
Keywords: Legged Robots, Dynamics, Multi-Contact Whole-Body Motion Planning and Control Abstract: Accurate estimation of contact forces is crucial for the effective control of quadrupedal robots, especially in complex locomotion scenarios. In this paper, we introduce a novel force estimation technique for robots equipped with transformable leg-wheels. Unlike conventional methods that focus on forces at specific contact points, our approach expresses varying contact points through a simplified kinematic model and derives the corresponding Jacobian matrices. This allows us to apply the virtual work method to evaluate contact forces across the entire surface of the leg-wheel, including the tips, sides, and other contact regions. This adaptability is particularly advantageous in hybrid locomotion modes, where different parts of the leg-wheel interact with the terrain. The proposed method is highly efficient, relying solely on motor current and position feedback without the need for additional sensors. We validate our approach through simulations and real-world experiments, demonstrating its accuracy, robustness, and applicability under diverse operational conditions.

10:10-10:15, Paper TuAT12.4	Add to My Program
Simultaneous Collision Detection and Force Estimation for Dynamic Quadrupedal Locomotion

Zhou, Ziyi	Georgia Institute of Technology
Di Cairano, Stefano	Mitsubishi Electric Research Laboratories
Wang, Yebin	Mitsubishi Electric Research Laboratories
Berntorp, Karl	Mitsubishi Electric Research Labs
Keywords: Legged Robots, Motion Control Abstract: In this paper we address the simultaneous collision detection and force estimation problem for quadrupedal loco- motion using joint encoder information and the robot dynamics only. We design an interacting multiple-model Kalman filter (IMM-KF) that estimates the external force exerted on the robot and multiple possible contact modes. The method is invariant to any gait pattern design. Our approach leverages pseudo-measurement information of the external forces based on the robot dynamics and encoder information. Based on the estimated contact mode and external force, we design a reflex motion and an admittance controller for the swing leg to avoid collisions by adjusting the leg’s reference motion. Additionally, we implement a force-adaptive model predictive controller to enhance balancing. Simulation ablatation studies and experiments show the efficacy of the approach.

10:15-10:20, Paper TuAT12.5	Add to My Program
PROBE: Proprioceptive Obstacle Detection and Estimation While Navigating in Clutter

Metha Ramesh, Dhruv	Rutgers University
Sivaramakrishnan, Aravind	Amazon Fulfillment Technology & Robotics
Keskar, Shreesh	Rutgers University
Bekris, Kostas E.	Rutgers, the State University of New Jersey
Yu, Jingjin	Rutgers University
Boularias, Abdeslam	Rutgers University
Keywords: Legged Robots, Sensorimotor Learning, Mapping Abstract: In critical applications, including search-and- rescue in degraded environments, blockages can be prevalent and prevent the effective deployment of certain sensing modalities, particularly vision, due to occlusion and the constrained range of view of onboard camera sensors. To enable robots to tackle these challenges, we propose a new approach, Proprioceptive Obstacle Detection and Estimation while navigating in clutter (PROBE), which instead relies only on the robot’s proprioception to infer the presence or absence of occluded rectangular obstacles while predicting their dimensions and poses in SE(2). The proposed approach is a Transformer neural network that receives as input a history of applied torques and sensed whole-body movements of the robot and returns a parameterized representation of the obstacles in the environment. The effectiveness of PROBE is evaluated on simulated environments in Isaac Gym and with a real Unitree Go1 quadruped robot. The project webpage can be found at https://dhruvmetha.github.io/legged-probe/

10:20-10:25, Paper TuAT12.6	Add to My Program
Fast Decentralized State Estimation for Legged Robot Locomotion Via EKF and MHE

Xiong, Xiaobin	University of Wisconsin Madison
Kang, Jiarong	University of Wisconsin Madison
Wang, Yi	Columbia University
Keywords: Legged Robots, Sensor Fusion Abstract: In this paper, we present a fast and decentralized state estimation framework for the control of legged locomotion. The nonlinear estimation of the floating base states is decentralized to an orientation estimation via Extended Kalman Filter (EKF) and a linear velocity estimation via Moving Horizon Estimation (MHE). The EKF fuses the inertia sensor with vision to estimate the floating base orientation. The MHE uses the estimated orientation with all the sensors within a time window in the past to estimate the linear velocities based on a time-varying linear dynamics formulation of the interested states with state constraints. More importantly, a marginalization method based on the optimization structure of the full information filter (FIF) is proposed to convert the equality-constrained FIF to an equivalent MHE. This decoupling of state estimation promotes the desired balance of computation efficiency, accuracy of estimation, and the inclusion of state constraints. The proposed method is shown to be capable of providing accurate state estimation to several legged robots, including the highly dynamic hopping robot PogoX, the bipedal robot Cassie, and the quadrupedal robot Unitree Go1, with a frequency at 200 Hz and a window interval of 0.1s.


TuAT13 Regular Session, 316	Add to My Program
Assistive Robotics 1

Chair: Cabrera, Maria Eugenia	University of Massachusetts Lowell
Co-Chair: Xiao, Chenzhang	University of Illinois at Urbana-Champaign

09:55-10:00, Paper TuAT13.1	Add to My Program
Elderly Bodily Assistance Robot (E-BAR): A Robot System for Body-Weight Support, Ambulation Assistance, and Fall Catching, without the Use of a Harness

Bolli, Roberto	MIT
Asada, Harry	MIT
Keywords: Physically Assistive Devices, Domestic Robotics, Mechanism Design Abstract: As over 11,000 people turn 65 each day in the U.S., our country, like many others, is facing growing challenges in caring for elderly persons, further exacerbated by a major shortfall of care workers. To address this, we introduce an eldercare robot (E-BAR) capable of lifting a human body, assisting with postural changes/ambulation, and catching a user during a fall, all without the use of any wearable device or harness. Our robot is the first to integrate these 3 tasks, and is capable of lifting the full weight of a human outside of the robot’s base of support (across gaps and obstacles). In developing E-BAR, we interviewed nurses and care professionals and conducted user-experience tests with elderly persons. Based on their functional requirements, the design parameters were optimized using a computational model and trade-off analysis. We developed a novel 18-bar linkage to lift a person from a floor to a standing position along a natural trajectory, while providing maximal mechanical advantage at key points. An omnidirectional, non-holonomic drive base, in which the wheels could be oriented to passively maximize floor grip, enabled the robot to resist lateral forces without active compensation. With a minimum width of 38 cm, the robot’s small footprint allowed it to navigate the typical home environment. Four airbags were used to catch and stabilize a user during a fall in less than 250 ms. We demonstrate E-BAR's utility in multiple typical home scenarios, including getting into/out of a bathtub, bending to reach for objects, sit-to-stand transitions, and ambulation.

10:00-10:05, Paper TuAT13.2	Add to My Program
A Cane-Mounted System for Dynamic Orientation Prediction for Correcting Incorrect Cane-Tapping by Visually Challenged Persons

Singh, Gagandeep	Yardi School of Artificial Intelligence, Indian Institute of Tec
Nadir, Mohd	Indian Institute of Technology
Chanana, Piyush	School of Information Technology, Indian Institute of Technology
Paul, Rohan	Indian Institute of Technology Delhi
Keywords: Wearable Robotics, AI-Based Methods, Health Care Management Abstract: People with visual impairments rely on Electronic Travel Aids (ETAs), such as sensor-equipped guide canes, for safe and effective navigation. Misalignment or improper handling of these devices can reduce their effectiveness, increasing the risk of collisions and injuries. This paper presents an AI-based embedded system designed to predict and correct the orientation of a guide cane in real time. By integrating an Inertial Measurement Unit (IMU) with a neural network, the system continuously monitors the cane's lateral angle and orientation while providing feedback to help the user self-correct. The feedback is proportional to the degree of error, guiding users to maintain proper cane positioning during mobility. The device logs data that can be visualized remotely, offering mobility trainers valuable insights into the user's navigation patterns. Evaluation by visually impaired users demonstrated that the system effectively aided in real-time orientation correction. This system represents a significant advancement in improving the safety and independence of individuals with visual impairments through wearable ETAs.

10:05-10:10, Paper TuAT13.3	Add to My Program
SRL-Gym: A Morphology and Controller Co-Optimization Framework for Supernumerary Robotic Limbs in Load-Bearing Locomotion

Meng, Lingyi	University of Chinese Academy of Sciences
Zheng, Enhao	Institute of Automation, Chinese Academy of Sciences
Li, Xiong	Tencent
Zhang, Zhong	City University of Hong Kong
Keywords: Wearable Robotics, Physically Assistive Devices, Human and Humanoid Motion Analysis and Synthesis Abstract: Supernumerary Robotic Limbs (SRLs) can assist human motions by providing extra degrees of freedom (DoFs) and body support. The extra DoFs lead to larger design space in structure and control policies, which is complex and timeconsuming with the traditional manual design process. In this pilot study, we proposed a novel morphology-controller cooptimization framework to automatically generate and optimize the SRL structure based on the locomotion task input. There are two layers, with the inner layer optimizing the controller to achieve human-robot synchronization, and the outer layer optimizing the morphology parameters for performance enhancement. We validated the proposed framework through simulations using SRLs in a load-bearing locomotion task. The results demonstrate that the controller optimization can automatically generate realistic gait patterns and stable humanrobot synchronization, while the SRLs significantly improve the user’s load-bearing capability. Additionally, the co-optimization process reduces both the manufacturing cost of the SRL and the torque on the joints. This approach shows potential for exhaustive exploration of the design space and acceleration of the design process. Future works will be done in a more realistic SRL generative design model and achieve Sim2Real for practical uses.

10:10-10:15, Paper TuAT13.4	Add to My Program
Adaptive Walker: User Intention and Terrain Aware Intelligent Walker with High-Resolution Tactile and IMU Sensor

Choi, Yunho	Gwangju Institute of Science and Technology
Hwang, Seokhyun	University of Washington
Moon, Jaeyoung	Gwangju Institute of Science and Technology
Lee, Hosu	Gyeongsang National University
Yeo, Dohyeon	Gwangju Institute of Science and Technology
Seong, Minwoo	Gwangju Institute of Science and Technology
Luo, Yiyue	University of Washington
Kim, SeungJun	Gwangju Institute of Science and Technology
Matusik, Wojciech	MIT
Rus, Daniela	MIT
Kim, Kyung-Joong	Gwangju Institute of Science and Technology
Keywords: Physically Assistive Devices, Rehabilitation Robotics, Machine Learning for Robot Control Abstract: In this paper, we present an adaptive walker system designed to address limitations in current intelligent walker technologies. While recent advancements have been made in this field, existing systems often struggle to seamlessly interpret user intent for speed control and lack adaptability across diverse scenarios and terrain. Our proposed solution incorporates high-resolution tactile sensors, deep learning algorithms, IMU sensors, and linear motors to dynamically adjust to the user’s intentions and terrain changes. The system is capable of predicting the user’s desired speed with an error margin of only 20.99%, relying solely on tactile input from hand and arm contact points. Additionally, it maintains the walker’s horizontal stability with an error of less than 1 degree by adjusting leg lengths in response to variations in ground angle. This adaptive walker enhances user safety and comfort, particularly for individuals with reduced strength or cognitive abilities, and offers reliable assistance on uneven terrain such as uphill and downhill paths.

10:15-10:20, Paper TuAT13.5	Add to My Program
IMRL: Integrating Visual, Physical, Temporal, and Geometric Representations for Enhanced Food Acquisition

Liu, Rui	University of Maryland
Mahammad, Zahiruddin	University of Maryland College Park
Bhaskar, Amisha	University of Maryland, College Park
Tokekar, Pratap	University of Maryland
Keywords: Human-Centered Robotics, Representation Learning, Imitation Learning Abstract: Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot’s capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach’s robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves an improvement up to 35% in success rate compared with the best-performing baseline. More details can be found on our website https://ruiiu.github.io/imrl.

10:20-10:25, Paper TuAT13.6	Add to My Program
An Interactive Hands-Free Controller for a Riding Ballbot to Enable Simple Shared Control Tasks

Xiao, Chenzhang	University of Illinois at Urbana-Champaign
Song, Seung Yun	University of Illinois at Urbana-Champaign
Chen, Yu	University of Illinois at Urbana-Champaign
Mansouri, Mahshid	University of Illinois at Urbana-Champaign
Ramos, Joao	University of Illinois at Urbana-Champaign
Norris, William	University of Illinois Urbana-Champaign
Hsiao-Wecksler, Elizabeth T.	University of Illinois at Urbana-Champaign
Keywords: Physically Assistive Devices, Physical Human-Robot Interaction, Human-Centered Robotics Abstract: Our team developed a riding ballbot (called PURE) that is dynamically stable, omnidirectional, and driven by lean-to-steer control. A hands-free admittance control scheme (HACS) was previously integrated to allow riders with different torso functions to control the robot's movements via torso leaning and twisting. Such an interface requires motor coordination skills and could result in collisions with obstacles due to low proficiency. Hence, a shared controller (SC) that limits the speed of PURE could be helpful to ensure the safety of riders. However, the self-balancing dynamics of PURE could result in a weak control authority of its motion, in which the torso motion of the rider could easily result in poor tracking of the command speed dictated by the shared controller. Thus, we proposed an interactive hands-free admittance control scheme (iHACS), which added two modules to the HACS to improve the speed-tracking performance of PURE: control gain personalization module and interaction compensation module. Human riding tests of simple tasks, idle-keeping and speed-limiting, were conducted to compare the performance of HACS and iHACS. Two manual wheelchair users and two able-bodied individuals participated in this study. They were instructed to use ``adversarial" torso motions that would tax the SC's ability to keep the ballbot idling or below a set speed, i.e., competing objectives between rider and robot. In the idle-keeping tasks, iHACS demonstrated minimal translational motion and low command speed tracking RMSE, even with significant torso lean angles. During the speed-limiting task, where the commanded speed was saturated at 0.5 m/s, the system achieved an average maximum speed of 1.1 m/s with iHACS, compared with that of over 1.9 m/s with HACS. These results suggest that iHACS can enhance PURE's control authority over the rider, which enables PURE to provide physical interactions back to the rider and results in a collaborative rider-robot synergy.

10:25-10:30, Paper TuAT13.7	Add to My Program
Garment Diffusion Models for Robot-Assisted Dressing

Kotsovolis, Stelios	Imperial College London
Demiris, Yiannis	Imperial College London
Keywords: Physical Human-Robot Interaction, Model Learning for Control, Human-Centered Robotics Abstract: Robots have the potential to assist people with disabilities and the elderly. One of the most common and burdensome tasks for caregivers is dressing. Two challenges of robot-assisted dressing are modeling the dynamics of garments and handling visual occlusions that obstruct the perception of the full state of the garment due to the proximity between the garment, the robot, and the human. In this paper, we propose a diffusion-based dynamics model for garments during robot-assisted dressing that can deal with partial point cloud observations. The diffusion model, conditioned on the observation and the robot's action, is used to predict a full point cloud of the garment's opening of the future state. The model is utilized in a model predictive controller, that is trained iteratively with model-based reinforcement learning. In our experiments, we examine a common problem of dressing: the insertion of a garment's sleeve on an arm. As demonstrated by the performed experiments, the proposed diffusion-based model predictive controller can be effectively used for robot-assisted dressing and handle visual occlusions. Moreover, our approach is highly sample-efficient. Specifically, the controller achieved 91.2% success rate in the examined dressing task with less than 100 sampled trajectories. Real-wold experiments demonstrate that the proposed method can adapt to the sim-to-real gap and generalize well to novel garments and configurations of the body.


TuAT14 Regular Session, 402	Add to My Program
Tracking and Prediction 1

Chair: Dionigi, Alberto	University of Perugia
Co-Chair: Tang, Chen	University of California Berkeley

09:55-10:00, Paper TuAT14.1	Add to My Program
Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

Bokkasam, Ruthvik	IIIT Hyderabad
Gangisetty, Shankar	IIIT Hyderabad
Abdul Hafez, A. H.	Hasan Kalyoncu Uiversity
Jawahar, C.V.	IIIT, Hyderabad
Keywords: Data Sets for Robotic Vision, Vision-Based Navigation, Intelligent Transportation Systems Abstract: With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle’s attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to 15%, while trajectory prediction methods underperform with an increase of up to 1208 MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models. Project Page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/iddped

10:00-10:05, Paper TuAT14.2	Add to My Program
Visual-Linguistic Reasoning for Pedestrian Trajectory Prediction

Shenkut, Dereje	Carnegie Mellon University
Vijaya Kumar, B.V.K	Carnegie Mellon University
Keywords: Intelligent Transportation Systems Abstract: Accurate prediction of pedestrian trajectories is crucial as autonomous vehicles become more prevalent on roads. The dynamic nature of urban environments and the less predictable behavior of pedestrians present significant challenges in developing reliable prediction models. Earlier methods relying on recurrent neural networks (RNNs) and long-short-term memory (LSTM) networks have shown promise, but often fail to fully take advantage of the rich visual and contextual information available in real-world scenarios. Recent advances in vision-language models (VLMs) offer new opportunities to improve pedestrian trajectory prediction by incorporating multimodal reasoning capabilities. This paper introduces a novel approach that uses a powerful pre-trained VLM to improve the estimation of pedestrian trajectories. Specifically, we first enable learning of semantically useful scene context and high-level reasoning feature via vision-language model fine-tuning on specific prompts using road scenes with pedestrians. Next, with the learned VLM features and the pedestrian's past trajectory history, we predict future trajectories using an encoder-decoder head. Through experiments with first-person datasets JAAD and PIE, we show that utilizing visual-linguistic semantics via a pre-trained vision-language model outperforms previous methods in both deterministic and stochastic trajectory prediction setups.

10:05-10:10, Paper TuAT14.3	Add to My Program
Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving

Ahmadi, Ehsan	University of Alberta
Mercurius, Ray Coden	University of Toronto
Mohamad Alizadeh Shabestary, Soheil	Huawei Technologies Canada
Rezaee, Kasra	Huawei Technologies
Rasouli, Amir	Huawei Technologies Canada
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Autonomous Agents Abstract: Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent’s behavior. Such perturbations can lead to incorrect predictions of other agents’ trajectories, potentially compromising the safety and efficiency of the ego-vehicle’s decision-making process. Motivated by this challenge, we propose Causal tRajecTory predICtion (CRiTIC), a novel model that utilizes a causal discovery network to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel Causal Attention Gating mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to 54% without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to 29% improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains.

10:10-10:15, Paper TuAT14.4	Add to My Program
Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

Ishaq, Ayesha	Mohamed Bin Zayed University of Artificial Intelligence
Boudjoghra, Mohamed El Amine	Mohamed Bin Zayed University of Artificial Intelligence
Lahoud, Jean	MBZUAI
Khan, Fahad	Linkoping University
Khan, Salman	CSIRO
Cholakkal, Hisham	MBZUAI
Anwer, Rao	MBZUAI
Keywords: Visual Tracking, Visual Learning, Deep Learning for Visual Perception Abstract: 3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings.

10:15-10:20, Paper TuAT14.5	Add to My Program
Asynchronous Multi-Object Tracking with an Event Camera

Apps, Angus	Australian National University
Wang, Ziwei	Australian National University
Perejogin, Vladimir	Defence Science and Technology Organisation
Molloy, Timothy L.	Australian National University
Mahony, Robert	Australian National University
Keywords: Object Detection, Segmentation and Categorization, Visual Tracking, Data Sets for Robotic Vision Abstract: Events cameras are ideal sensors for enabling robots to detect and track objects in highly dynamic environments due to their low latency output, high temporal resolution, and high dynamic range. In this paper, we present the Asynchronous Event Multi-Object Tracking (AEMOT) algorithm for detecting and tracking multiple objects by processing individual raw events asynchronously. AEMOT detects salient event blob features by identifying regions of consistent optical flow using a novel Field of Active Flow Directions built from the Surface of Active Events. Detected features are tracked as candidate objects using the recently proposed Asynchronous Event Blob (AEB) tracker in order to construct small intensity patches of each candidate object. A novel learnt validation stage promotes or discards candidate objects based on classification of their intensity patches, with promoted objects having their position, velocity, size, and orientation estimated at their event rate. We evaluate AEMOT on a new Bee Swarm Dataset, where it tracks dozens of small bees with precision and recall performance exceeding that of alternative event-based detection and tracking algorithms by over 37%. Source code and the labelled event Bee Swarm Dataset will be open sourced.

10:20-10:25, Paper TuAT14.6	Add to My Program
Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving

Zhang, Xinyu	Tongji University
Zhou, Zewei	University of California, Los Angeles
Wang, Zaoyi	Tongji University
Ji, Yangjie	Tongji University，College of Automotive Studies
Huang, Yanjun	Tongji University
Chen, Hong	Tongji University
Keywords: Computer Vision for Transportation, Sensor Fusion, Deep Learning Methods Abstract: Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction. Our code is available on our project website: https://xiaomiaozhang.github.io/Co-MTP/

10:25-10:30, Paper TuAT14.7	Add to My Program
Predictive Spliner: Data-Driven Overtaking in Autonomous Racing Using Opponent Trajectory Prediction

Baumann, Nicolas	ETH
Ghignone, Edoardo	ETH
Hu, Cheng	Zhejiang University
Hildisch, Benedict	ETH Zurich
Hämmerle, Tino	ETH Zürich
Bettoni, Alessandro	University
Carron, Andrea	ETH Zurich
Xie, Lei	State Key Laboratory of Industrial Control Technology, Zhejiang
Magno, Michele	ETH Zurich
Keywords: Wheeled Robots, Collision Avoidance, Embedded Systems for Robotic and Automation Abstract: Head-to-head racing against opponents is a challenging and emerging topic in the domain of autonomous racing. We propose Predictive Spliner, a data-driven overtaking planner designed to enhance competitive performance by anticipating opponent behavior. Using GP regression, the method learns and predicts the opponent’s trajectory, enabling the ego vehicle to calculate safe and effective overtaking maneuvers. Experimentally validated on a 1:10 scale autonomous racing platform, Predictive Spliner outperforms commonly employed overtaking algorithms by overtaking opponents at up to 83.1% of its own speed, being on average 8.4% faster than the previous best-performing method. Additionally, it achieves an average success rate of 84.5%, which is 47.6% higher than the previous best-performing method. The proposed algorithm maintains computational efficiency, making it suitable for real-time robotic applications. These results highlight the potential of Predictive Spliner to enhance the performance and safety of autonomous racing vehicles. The code for Predictive Spliner is available at: https://github.com/ForzaETH/predictive-spliner.


TuAT15 Regular Session, 403	Add to My Program
Surgical Robotics: Continuum Robots

Chair: Rodrigue, Hugo	Seoul National University
Co-Chair: Park, Sukho	DGIST

09:55-10:00, Paper TuAT15.1	Add to My Program
Workspace Expansion of Magnetic Soft Continuum Robot Using Movable Opposite Magnet

Park, Joo-Won	DGIST
Kee, Hyeonwoo	DGIST
Park, Sukho	DGIST
Keywords: Surgical Robotics: Steerable Catheters/Needles, Soft Robot Materials and Design, Micro/Nano Robots Abstract: Recently, in the minimally invasive surgery field, magnetic soft continuum robots (MSCRs) have been actively studied, which are driven by an external magnetic field with a small magnet attached to the tip of the SCR. In addition, MSCR with opposite magnets (MSCR-OMs) has been reported for high dexterity, which has a small permanent magnet attached to the end of the MSCR and an additional opposite magnet fixed in the middle. To overcome the limitations of the existing MSCR and MSCR-OM and improve the workspace, we proposed a magnetic soft continuum robot with a movable opposite magnet (MSCR-MOM) with a 2.2 mm diameter and 10cm length, that can change the position of the opposite magnet. In this study, an analytical model of the proposed MSCR-MOM was presented, and through simulation and various experiments, its characteristics were analyzed and the workspace expansion was validated. In addition, the clinical applicability of the proposed MSCR-MOM was verified through phantom experiments. In the future, we expect that the proposed MSCR-MOM will be developed into a medical catheter that can be applied in various procedures through miniaturization and various clinical application studies.

10:00-10:05, Paper TuAT15.2	Add to My Program
Sim4EndoR: A Reinforcement Learning Centered Simulation Platform for Task Automation of Endovascular Robotics

Yao, Tianliang	Tongji University
Ban, Madaoji	The University of Hong Kong
Lu, Bo	Soochow University
Pei, Zhiqiang	University of Shanghai for Science and Technology
Qi, Peng	Tongji University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots Abstract: Robotic-assisted percutaneous coronary intervention (PCI) holds considerable promise for elevating precision and safety in cardiovascular procedures. Nevertheless, current systems heavily depend on human operators, resulting in variability and the potential for human error. To tackle these challenges, Sim4EndoR, an innovative reinforcement learning (RL) based simulation environment, is first introduced to bolster task-level autonomy in PCI. This platform offers a comprehensive and risk-free environment for the development, evaluation, and refinement of potential autonomous systems, enhancing data collection efficiency and minimizing the need for costly hardware trials. A notable aspect of the groundbreaking Sim4EndoR is its reward function, which takes into account the anatomical constraints of the vascular environment, utilizing the geometric characteristics of vessels to steer the learning process. By seamlessly integrating advanced physical simulations with neural network-driven policy learning, Sim4EndoR fosters efficient sim-to-real translation, paving the way for safer, more consistent robotic interventions in clinical practice, ultimately improving patient outcomes.

10:05-10:10, Paper TuAT15.3	Add to My Program
Design and Implementation of a Snake Robot for Cranial Surgery

Law, Jones	University of Toronto
Stickley, Emma	The Hospital for Sick Children
Gondokaryono, Radian	University of Toronto
Looi, Thomas	Hospital for Sick Children
Diller, Eric D.	University of Toronto
Podolsky, Dale	University of Toronto
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism Abstract: Craniosynostosis involves premature fusion of the cranial sutures resulting in abnormal skull morphology and elevated intracranial pressure. Surgical intervention is necessary to correct the skull shape and to allow for unrestricted brain growth. This study presents a novel snake robot designed for minimally invasive cranial osteotomies featuring two articulating bending segments. The end-effector comprises a bone-punch for bone-cutting, a dural and scalp retractor, as well as channels for an endoscope and an instrument. The robot’s bending mechanism is driven by tendons and utilizes geared linkages to facilitate a smooth curved shape. Pre-tensioned antagonistic tendons allow the robot to modulate its stiffness to adapt to external loads. A follow-the-leader algorithm was implemented to guide the robot along a skull cutting path. Experimental results demonstrated that at maximum bending of 60◦ for segment 1 and 90◦ for segment 2 there was a 15.9◦ and 11.5◦ error, respectively. Position errors ranged from 2.5 to 21.5mm when tracing a curved path. The tool increased stiffness with tendon pre-tensioning from 20-100N during bent configurations q1 and q2 for segments 1 and 2, respectively, at [q1,q2] = [0◦,30◦] and [30◦,60◦]. Tip deflection reduced from 0.42 to 0.03cm and 0.37 to 0.10cm during axial loading and from 11.40 to 3.88cm and 3.62 to 0.48cm during radial loading for each configuration, respectively. Ex vitro trials demonstrated the robots ability to perform simulated osteotomies on skull models to 68-73% of desired path lengths with a maximum deviation of 8mm.

10:10-10:15, Paper TuAT15.4	Add to My Program
Single-Fiber Optical Frequency Domain Reflectometry(OFDR) Shape Sensing of Continuum Manipulators with Planar Bending

Tavangarifard, Mobina	The University of Texas at Austin
Rodriguez Ovalle, Wendy	The University of Texas at Austin
Alambeigi, Farshid	University of Texas at Austin
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Soft Sensors and Actuators Abstract: To address the challenges associated with shape sensing of continuum manipulators (CMs) using Fiber Bragg Grating (FBG) optical fibers, we present a unique shape sensing assembly utilizing solely a single Optical Frequency Domain Reflectometry (OFDR) fiber attached to a flat nitinol wire (NiTi).Integrating this easy-to-manufacture unique sensor with a long and soft CM with 170 mm length, we performed different experiments to evaluate its C-, J-, and S-shape reconstruction ability. Results demonstrate phenomenal shape reconstruction accuracy for the performed C-shape (< 3.14 mm tip error, < 2.54 mm shape error), J-shape (< 1.91 mm tip error, < 1.11 mm shape error), and S-shape (< 1.74 mm tip error, < 1.40 mm shape error) experiments.

10:15-10:20, Paper TuAT15.5	Add to My Program
Learning-Based Tip Contact Force Estimation for FBG-Embedded Continuum Robots

Roshanfar, Majid	Postdoctoral Research Fellow at the Hospital for Sick Children (
Fekri, Pedram	Concordia University
Nguyen, Robert Hideki	The Hospital for Sick Children
He, Changyan	University of Newcastle, Australia
Kang, Paul Hoseok	University of Toronto
Drake, James	Hospital for Sick Children, University of Toronto
Diller, Eric D.	University of Toronto
Looi, Thomas	Hospital for Sick Children
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Haptics and Haptic Interfaces Abstract: Knowledge of the tip contact force in continuum robots, which are often used as medical instruments, is critical for clinical applications. It enhances the interventionalist's decision-making, navigation efficiency, and procedural safety. However, accurately determining the tip contact force in conventionally sized instruments remains challenging. This study introduces a learning-based method for estimating the external contact force at the tip of a continuum robot. By leveraging curvature and bending angle data from a multi-core fiber equipped with fiber Bragg gratings (FBGs) embedded inside the Nitinol tube, the method maps these inputs to the corresponding tip force in 3D. Experiments conducted on an FBG-embedded Nitinol rod validate the feasibility of the proposed method, yielding Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) values of 20.9 (mN²), 2.7 (mN), and 4.6 (mN), respectively, which represent a 26% improvement compared to the learning-based vision methodology.

10:20-10:25, Paper TuAT15.6	Add to My Program
Three-Dimension Tip Force Perception and Axial Contact Location Identification for Flexible Endoscopy Using Tissue-Compliant Soft Distal Attachment Cap Sensors

Zhang, Tao	Chinese University of Hong Kong
Yang, Yang	Sichuan University
Yang, Yang	The Chinese University of Hong Kong
Gao, Huxin	National University of Singapore
Lai, Jiewen	The Chinese University of Hong Kong
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Medical Robots and Systems, Force and Tactile Sensing Abstract: In endoluminal surgeries, inserting a flexible endoscope is one of the fundamental procedures. During this process, vision remains the primary feedback, while the perception of tactile magnitude and location is insufficient. This limitation can hinder the clinician’s efficiency when navigating the endoscope through various segments of the natural lumens. To address this issue, we propose a fiber Bragg grating (FBG)–based tissue-compliant sensor cap with multi-mode sensing capabilities, including contact location identification at the terminal surface and the three-dimensional contact force perception at the tip. The soft sensor cap can be affixed to the standard endoscope tip, like a distal attachment cap, for easy installation. Utilizing the relative contact location information, operators can adjust the steerable segment of the endoscope when transitioning from one segment of a natural orifice to a narrower segment, which may be obstructed by constricted lumens. A finite element analysis simulation and the corresponding calibration process based on learning-based approaches have been carried out. The FBG-based sensor can perceive the tip contact force and identify the axial contact location with high precision, where the force perception error is less than 3%, and the contact location identification accuracy is 98.8%. The experimental results demonstrate the potential of the proposed sensing mechanism to be applied in surgeries requiring endoscope insertions.

10:25-10:30, Paper TuAT15.7	Add to My Program
MPC Design of a Continuum Robot for Pulmonary Interventional Surgery Using Koopman Operators

Song, Yuhua	Southeast University
Zhu, Lifeng	Southeast University
Li, Jinfeng	Zhuhai Hanglok Medical Technology Co., Ltd
Deng, Jiawei	Hanglok-Tech Co., Ltd
Wang, Cheng	Hanglok-Tech Co. Ltd
Song, Aiguo	Southeast University
Keywords: Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots, Optimization and Optimal Control Abstract: This study focuses on the flexible tube of a bronchoscope robot used in pulmonary intervention surgery, which is considered as a continuum robot. The dynamics model is proposed based on the Koopman operator, leveraging real data to solve for the system matrix parameters accurately. To enhance control precision, we designed a model predictive control (MPC) algorithm aimed at tracking the desired curvature and deflection angles of the flexible tube. The MPC controller uses real-time data from electromagnetic sensors to adjust the tube shape, ensuring accurate and responsive manipulation. The effectiveness of the proposed algorithm are validated through extensive experiments conducted on the Binary experimental platform, demonstrating significant improvements in tracking performance and operational reliability compared to traditional open-loop control methods.


TuAT16 Regular Session, 404	Add to My Program
Manipulation 1

Chair: Holladay, Rachel	University of Pennsylvania
Co-Chair: Saveriano, Matteo	University of Trento

09:55-10:00, Paper TuAT16.1	Add to My Program
A Perturbation-Robust Framework for Admittance Control of Robotic Systems with High-Stiffness Contacts and Heavy Payload

Samuel, Kangwagye	Technical University of Munich
Haninger, Kevin	Fraunhofer IPK
Oboe, Roberto	University of Padova
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Oh, Sehoon	DGIST
Keywords: Compliance and Impedance Control, Human-Robot Collaboration, Motion Control Abstract: Applications involving serial manipulators, in both co-manipulation with humans and autonomous operation tasks, require the robot to render high admittance so as to minimize contact forces and maintain stable contacts with high-stiffness surfaces. This can be achieved through admittance control, however, inner loop dynamics limit the bandwidth within which the desired admittance can be rendered from the outer loop. Moreover, perturbations affect the admittance control performance whereas other system specific limitations such as “black box” PD position control in typical industrial manipulators hinder the implementation of more advanced control methods. To address these challenges, a perturbation-robust multisensor framework, designed for serial manipulators engaged in contact-rich tasks involving heavy payloads, is introduced in this paper. Within this framework, a generalized perturbation-robust observer (PROB), which exploits the joint velocity measurements and inner loop velocity control model, and accommodates the varying stiffness of contacts through contact force measurements is introduced. Three PROBs including a novel Combined Dynamics Observer (CDYOB) are presented. The CDYOB can render wide-range admittance without bandwidth limitations from the inner loop. Theoretical analyses and experiments with an industrial robot validate the effectiveness of the proposed method.

10:00-10:05, Paper TuAT16.2	Add to My Program
Tension Maintenance Mechanism for Control Consistency of Twisted String Actuation-Based Hyper-Redundant Manipulator

Cho, Minjae	KAIST
Yi, Yesung	Korea Advanced Institute of Science and Technology
Kyung, Ki-Uk	Korea Advanced Institute of Science & Technology (KAIST)
Keywords: Tendon/Wire Mechanism, Redundant Robots, Mechanism Design Abstract: Hyper-redundant manipulators have been developed for hazardous environment exploration due to their flexibility and high agility in workplace. In this research, we designed a hyper-redundant manipulator by integrating Twisted String Actuators (TSAs) and Rolling Contact Joints (RCJs) to overcome the limitations of traditional cable-driven system, such as difficulties with long-distance power transmission, and to achieve high payload capability with a compact design. To prevent instantaneous tension loss by the slack and to enhance control consistency of the manipulator by preserving the relationship between contraction ratio of TSA and motor rotations, we proposed a tension maintenance mechanism using compression springs at the distal end of the manipulator. Additionally, to reduce losses from string contact friction, spring sheaths were inserted along the joint holes. Our approaches enhance the repeatability and position controllability of the manipulator. We noted a 33.5% reduction of error in repeatability test along with 35.9% and 38.8% improvements in piecewise position control accuracy and precision compared to a conventional manipulator, respectively, leading to enhanced controllability. We also experimentally verified that the proposed manipulator can maintain its trajectory with a variance of less than 2.83% up to 1600 g. Overall, our manipulator has the potential to expand the exploration environments in which robots can be used by simultaneously demonstrating large payload and controllability.

10:05-10:10, Paper TuAT16.3	Add to My Program
The Franka Emika Robot: A Standard Platform in Robotics Research

Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Compliance and Impedance Control, Force Control, Performance Evaluation and Benchmarking Abstract: Over the last decade, industrial robots have evolved from well-established position-controlled systems to collaborative and soft robots. In 2017 we introduced the tactile lightweight robot system Franka Emika Robot, characterized by advanced safety control, force sensing, joint torque and force control, and hand-guiding performance. In the meantime, the system has become a well-adopted reference platform for robotics research in AI and machine learning, manipulation, control, human-robot interaction, and motion planning. It features multiple functional and widely used interfaces, including 1kHz real-time joint torque control access or precise kinematic and dynamic models. Furthermore, it became a crystallization point of a research ecosystem since the system's affordability further lowered the entrance barrier to high-performance joint torque-controlled robots. In this article, a quantitative analysis and discussion of the use of the system in worldwide research labs over the last five years, its impact on the creation of a compatible software ecosystem, and examples of milestone experiments made possible with the robot are given. The robotics community benefits from understanding

10:10-10:15, Paper TuAT16.4	Add to My Program
MeshDMP: Motion Planning on Discrete Manifolds Using Dynamic Movement Primitives

Dalle Vedove, Matteo	University of Trento
Abu-Dakka, Fares	New York University Abu Dhabi
Palopoli, Luigi	University of Trento
Fontanelli, Daniele	University of Trento
Saveriano, Matteo	University of Trento
Keywords: Learning from Demonstration, Constrained Motion Planning Abstract: An open problem in industrial automation is to reliably perform tasks requiring in-contact movements with complex workpieces, as current solutions lack the ability to seamlessly adapt to the workpiece geometry. In this paper, we propose a Learning from Demonstration approach that allows a robot manipulator to learn and generalise motions across complex surfaces by leveraging differential mathematical operators on discrete manifolds to embed information on the geometry of the workpiece extracted from triangular meshes, and extend the Dynamic Movement Primitives (DMPs) framework to generate motions on the mesh surfaces. We also propose an effective strategy to adapt the motion to different surfaces, by introducing an isometric transformation of the learned forcing term. The resulting approach, namely MeshDMP, is evaluated both in simulation and real experiments, showing promising results in typical industrial automation tasks like car surface polishing.

10:15-10:20, Paper TuAT16.5	Add to My Program
Robotic Sim-To-Real Transfer for Long-Horizon Pick-And-Place Tasks in the Robotic Sim2Real Competition

Yang, Ming	University of Chinese Academy of Sciences
Cao, Hongyu	Tianjin University
Zhao, Lixuan	TianJin University
Zhang, Chenrui	TianJin University
Chen, Yaran	Institute of Automation, Chinese Academy of Sciense
Keywords: Engineering for Robotic Systems, Mobile Manipulation, Perception for Grasping and Manipulation Abstract: This paper presents a fully autonomous robotic system that performs sim-to-real transfer in complex long-horizon tasks involving navigation, recognition, grasping, and stacking in an environment with multiple obstacles. The key feature of the system is the ability to overcome typical sensing and actuation discrepancies during sim-to-real transfer and to achieve consistent performance without any algorithmic modifications. To accomplish this, a lightweight noise-resistant visual perception system and a nonlinearity-robust servo system are adopted. We conduct a series of tests in both simulated and real-world environments. The visual perception system achieves the speed of 11 ms per frame due to its lightweight nature, and the servo system achieves sub-centimeter accuracy with the proposed controller. Both exhibit high consistency during sim-to-real transfer. Our robotic system took first place in the mineral searching task of the Robotic Sim2Real Challenge hosted at ICRA 2024. The simulator is available from the competition committee at https://github.com/AIR-DISCOVER/ICRA2024-Sim2Real-RM, and all code and competition videos can be accessed via our GitHub repository at https://github.com/Bob-Eric/rmus2024_solution_ZeroBug.

10:20-10:25, Paper TuAT16.6	Add to My Program
Towards Autonomous Data Annotation and System-Agnostic Robotic Grasping Benchmarking with 3D-Printed Fixtures

Boerdijk, Wout	German Aerospace Center (DLR)
Durner, Maximilian	German Aerospace Center DLR
Sakagami, Ryo	German Aerospace Center (DLR)
Lehner, Peter	German Aerospace Center (DLR)
Triebel, Rudolph	German Aerospace Center (DLR)
Keywords: Deep Learning for Visual Perception, Grasping, Data Sets for Robotic Vision Abstract: The interaction of robots with their environment requires robust object-centric perception capabilities, typically achieved using learning-based methods trained on synthetic data. However, real-world deployment demands evaluating these capabilities in relevant environments, often involving extensive manual annotation for a quantitative analysis. Additionally, standardized evaluations for robotic tasks, such as grasping, need reproducible object scene configurations and performance benchmarks. We propose a solution to both problems by temporarily employing 3D-printed components, so-called fixtures, which can be designed for any rigid object. Once the scene is set up and object poses are extracted, the fixtures are removed, leaving the natural scene without any artificial distractions. The presented approach is seemingly applicable for pre-determined configurations of multiple objects, which enables precise re-building of scenes with consistent object-to-object relations. Our suggested annotation procedure achieves strong pose accuracy solely on RGB images without any manual involvement. We evaluate and show the usability of the proposed fixtures for automated real-world data annotation to fine-tune a detector and for benchmarking object pose estimation algorithms for robotic grasping. Code and fixture meshes for 3D printing are available at https://github.com/DLR-RM/fixture_generation.

10:25-10:30, Paper TuAT16.7	Add to My Program
From Instantaneous to Predictive Control: A More Intuitive and Tunable MPC Formulation for Robot Manipulators

Ubbink, Johan Bernard	KU Leuven
Viljoen, Ruan Matthys	KU Leuven
Aertbelien, Erwin	KU Leuven
Decré, Wilm	Katholieke Universiteit Leuven
De Schutter, Joris	KU Leuven
Keywords: Optimization and Optimal Control, Sensor-based Control, Motion Control Abstract: Model predictive control (MPC) has become increasingly popular for the control of robot manipulators due to its improved performance compared to instantaneous control approaches. However, tuning these controllers remains a significant hurdle. To address this hurdle, we propose a practical MPC formulation which retains the more interpretable tuning parameters of the instantaneous control approach while enhancing performance through a prediction horizon. The formulation is motivated at hand of a simple example, highlighting the practical tuning challenges associated with typical MPC approaches and showing how the proposed formulation alleviates these challenges. Furthermore, the formulation is validated on a surface-following task, illustrating its applicability to industrially relevant scenarios. Although the research is presented in the context of robot manipulator control, we anticipate that the formulation is more broadly applicable.


TuAT17 Regular Session, 405	Add to My Program
Prosthetics and Physically Assistive Devices

Chair: Hirata, Yasuhisa	Tohoku University
Co-Chair: Thomas, Gray	Texas A&M University

09:55-10:00, Paper TuAT17.1	Add to My Program
A Control Framework for Accurate Mechanical Impedance Rendering with Series-Elastic Joints in Prosthetic Actuation Applications

Harris, Isaac	University of Michigan
Rouse, Elliott	University of Michigan
Gregg, Robert D.	University of Michigan
Thomas, Gray	Texas A&M University
Keywords: Compliance and Impedance Control, Compliant Joints and Mechanisms, Prosthetics and Exoskeletons Abstract: In addition to lifting up the body during gait, human legs provide stabilizing torques that can be modeled as a spring-damper mechanical impedance. While powered prosthetic leg actuators can also imitate spring-damper be- haviors, the rendered impedance can be quite different from the desired impedance, stemming from unmodeled torques in the transmission (e.g., sliding friction, bearing damping, gear inefficiency, etc.). Moreover, for powered prostheses to mimic human joint impedance, they will need actuators that accurately render a wide range of mechanical impedances in a variety of ground contact conditions, including nearly free-swinging behavior in swing phase and stiff spring-like behavior in stance phase. For series-elastic prosthetic leg actuators, as in Open- Source Leg (OSL), these sudden output inertia changes present a challenge for traditional cascaded impedance control. In this paper we propose a solution based on disturbance observers and full state feedback (FSF) impedance control. With transmission disturbances attenuated, the FSF controller can use pole-zero placement to specify the actuator impedance that couples to the uncertain joint inertia. We validate our control framework on an OSL-like two-actuator dynamometry testbed.

10:00-10:05, Paper TuAT17.2	Add to My Program
Concept and Prototype Development of Adaptive Touch Walking Support Robot for Maximizing Human Physical Potential

Terayama, Junya	Tohoku University
Ravankar, Ankit A.	Tohoku University
Salazar Luces, Jose Victorio	Tohoku University
Tafrishi, Seyed Amir	Cardiff Univerity
Hirata, Yasuhisa	Tohoku University
Keywords: Human-Centered Robotics, Physically Assistive Devices, Motion Control Abstract: We propose a new walking support robot concept, "Nimbus Guardian," designed to enhance the mobility of both healthy and frail elderly individuals who can walk independently. The proposed robot differs from traditional walker-type or cane-type aids by offering adaptive, minimal touch support based on the user's walking dynamics. Our goal is to realize versatile touch to the user as a preliminary study for developing the adaptive touch walking support robot. To achieve this, we've established a categorization system for walking support touch, outlining the specific types of assistance required for our robot. Based on these categorization, we've developed a prototype that improves the versatility of touch support (touch point, force, and initiator), adapting to the user's body. Our prototype is equipped to offer multiple touch support parts, adjusting to the user's physique. For versatile touch capabilities, we designed a motion control algorithm that includes a controller which directs the robot's wheel movements according to the chosen support points, and a state machine that provides multiple arm placements and movements. We have experimentally implemented this motion control algorithm in our prototype. Through experiments, we verified the touch versatility and discussed the prototype's utility and potential for further development.

10:05-10:10, Paper TuAT17.3	Add to My Program
Learning and Online Replication of Grasp Forces from Electromyography Signals for Prosthetic Finger Control

Arbaud, Robin	HRI2 Lab., Istituto Italiano Di Tecnologia ; Dept. of Informatic
Motta, Elisa	Italian Institute of Technology
Avaro, Marco	Airworks S.r.l
Picinich, Stefano	Airworks
Lorenzini, Marta	Istituto Italiano Di Tecnologia
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Prosthetics and Exoskeletons, Human Factors and Human-in-the-Loop, Force Control Abstract: Partial hand amputations significantly affect the physical and psychosocial well-being of individuals, yet intuitive control of externally powered prostheses remains an open challenge. To address this gap, we developed a force-controlled prosthetic finger activated by electromyography (EMG) signals. The prototype, constructed around a wrist brace, functions as a supernumerary finger placed near the index, allowing for early-stage evaluation on unimpaired subjects. A neural network-based model was then implemented to estimate fingertip forces from EMG inputs, allowing for online adjustment of the prosthetic finger grip strength. The force estimation model was validated through experiments with ten participants, demonstrating its effectiveness in predicting forces. Additionally, online trials with four users wearing the prosthesis exhibited precise control over the device. Our findings highlight the potential of using EMG-based force estimation to enhance the functionality of prosthetic fingers.

10:10-10:15, Paper TuAT17.4	Add to My Program
Integrated Motion State Prediction for Sit-To-Stand and Stand-To-Sit Motions Toward Effective Power Assist Control

Ren, Kai	Kyoto University
Nakamura, Yuichi	Kyoto University
Kondo, Kazuaki	Kyoto University
Shimonishi, Kei	Kyoto University
Ito, Takahide	RIKEN
Furukawa, Jun-ichiro	Guardian Robot Project, RIKEN
An, Qi	The University of Tokyo
Keywords: Intention Recognition, Behavior-Based Systems, Physically Assistive Devices Abstract: Sit-to-stand and stand-to-sit motions are important in daily activities. However, elderly individuals often find these motions difficult to perform with declining lower limb strength, which causes a considerable reduction to their quality of life. In this study, a sensing method for controlling robotic assistive devices was proposed. This method utilizes electromyographic measurements and a deep neural network to predict motion initiation, and it estimates the timing of triggering assistive devices. Experimental results indicate that four muscle synergy patterns are required to represent the sit-to-stand and stand-to-sit motions together, with two of them being shared between both movements. Subsequently, a long short-term memory network was designed to forecast these two motions, and the result indicates that the prediction accuracy reached 92.95% +- 0.83% with forecasting time of 300 ms.

10:15-10:20, Paper TuAT17.5	Add to My Program
On Chain Driven, Adaptive, Underactuated Fingers for the Development of Affordable, Robust Humanlike Prosthetic Hands

Heinemann, Trevor	University of Auckland
Wallace, Raymond	The University of Auckland
Liarokapis, Minas	The University of Auckland
Keywords: Prosthetics and Exoskeletons, Medical Robots and Systems Abstract: Amputations and limb loss can have detrimental effects on personal well-being. Although prosthetic devices can offer significant benefits helping amputees regain some of the lost dexterity, they often lack the required affordability and durability. Current affordable prosthetic designs have trended towards underactuation which leads to stable grasps but is often characterized by low durability. In this paper, a new chain driven, adaptive, underactuated finger design has been proposed for the development of affordable and highly durable prosthetic hands. The transmission mechanism employed is in the form of a steel roller chain. The finger phalanges are constructed of 3D printed PLA, and finger flexion is produced by pulling the internally routed roller chain that is rerouted using sprockets. In total, six 3D printed PLA sprockets are used for chain routing, with an emphasis on high force transmission. The performance of the proposed chain-driven finger was experimentally validated and compared with an analogous tendon-driven version. The metrics employed for this comparison were longevity, pinch grasp efficiency, force response, and maximum force capability. The chain driven finger was shown to have a higher maximum transmissible force, better long term durability, and no issues related to elongation (such as tendon elongation). The cost to manufacture the chain driven robotic finger is only 91 USD, making it an excellent solution for affordable prostheses.

10:20-10:25, Paper TuAT17.6	Add to My Program
Force Myography Based Torque Estimation in Human Knee and Ankle Joints

Marquardt, Charlotte	Karlsruhe Institute of Technology (KIT)
Schulz, Arne	Karlsruhe Institute of Technology (KIT)
Dezman, Miha	Karlsruhe Institute of Technology
Kurz, Gunther	Karlsruhe Institute of Technology (KIT)
Stein, Thorsten	Karlsruhe Institute of Technology, Center
Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Keywords: Prosthetics and Exoskeletons, Physically Assistive Devices, Wearable Robotics Abstract: The online adaptation of exoskeleton control based on muscle activity sensing offers a promising approach to personalizing exoskeleton behavior based on the user’s biosignals. While electromyography (EMG)-based methods have demonstrated improvements in joint torque estimation, EMG sensors require direct skin contact and extensive post-processing. In contrast, force myography (FMG) measures normal forces resulting from changes in muscle volume due to muscle activity. We propose an FMG-based method to estimate knee and ankle joint torques by integrating joint angles and velocities with muscle activity data. We learn a model for joint torque estimation using Gaussian process regression (GPR). The effectiveness of the proposed FMG-based method is validated on isokinetic motions performed by ten participants. The model is compared to a baseline model that uses only joint angle and velocity, as well as a model augmented by EMG data. The results indicate that incorporating FMG into exoskeleton control can improve the estimation of joint torque for the ankle and knee joints in novel task characteristics within a single participant. Although the findings suggest that this approach may not improve the generalizability of estimates between multiple participants, they highlight the need for further research into its potential applications in exoskeleton control.

10:25-10:30, Paper TuAT17.7	Add to My Program
Adaptive Ankle-Foot Prosthesis with Passive Agonist-Antagonist Design

Crotti, Matteo	Istituto Italiano Di Tecnologia
Pace, Anna	Istituto Italiano Di Tecnologia
Grioli, Giorgio	Istituto Italiano Di Tecnologia
Bicchi, Antonio	Fondazione Istituto Italiano Di Tecnologia
Catalano, Manuel Giuseppe	Istituto Italiano Di Tecnologia
Keywords: Prosthetics and Exoskeletons Abstract: The development of prosthetic feet that closely replicate the natural biomechanics of the human foot remains a significant challenge in prosthetics engineering. This paper presents the design and testing of a novel agonist-antagonist architecture for the ankle joint of a passive prosthetic foot featuring an adaptive sole. The ankle mechanism, inspired by the dynamics of the human leg-ankle-foot complex, utilizes compliant elements in an agonist-antagonist configuration to passively achieve an ankle torque close to that of a sound ankle without the need for external actuation. Concurrently, the adaptive sole adjusts its shape in response to different terrains, potentially improving stability and comfort for the user. The theoretical model underlying the proposed design is presented, followed by a preliminary validation through simulations. Finally, a prototype based on the new architecture is tested by a healthy subject using customized walking boots, demonstrating its potential to improve the functional performance of prosthetic feet in diverse environments.


TuAT18 Regular Session, 406	Add to My Program
Intelligent Transportation and Smart Cities

Chair: Fanti, Maria Pia	Politecnico Di Bari
Co-Chair: Koide, Kenji	National Institute of Advanced Industrial Science and Technology

09:55-10:00, Paper TuAT18.1	Add to My Program
V2X-DGW: Domain Generalization for Multi-Agent Perception under Adverse Weather Conditions

Li, Baolu	Cleveland State University
Li, Jinlong	Cleveland State University
Liu, Xinyu	Cleveland State University
Xu, Runsheng	UCLA
Tu, Zhengzhong	Texas A&M University
Guo, Jiacheng	Cleveland State University
Zou, Qin	Wuhan University
Li, Xiaopeng	University of Wisconsin-Madison
Yu, Hongkai	Cleveland State University
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation, Cooperating Robots Abstract: Current LiDAR-based Vehicle-to-Everything (V2X) multi-agent perception systems have shown the significant success on 3D object detection. While these models perform well in the trained clean weather, they struggle in unseen adverse weather conditions with the domain gap. In this paper, we propose a Domain Generalization based approach, named V2X-DGW, for LiDAR-based 3D object detection on multi-agent perception system under adverse weather conditions. Our research aims to not only maintain favorable multi-agent performance in the clean weather but also promote the performance in the unseen adverse weather conditions by learning only on the clean weather data. To realize the Domain Generalization, we first introduce the Adaptive Weather Augmentation (AWA) to mimic the unseen adverse weather conditions, and then propose two alignments for generalizable representation learning: Trust-region Weather-invariant Alignment (TWA) and Agent-aware Contrastive Alignment (ACA). To evaluate this research, we add Fog, Rain, Snow conditions on two publicized multi-agent datasets based on physics-based models, resulting in two new datasets: OPV2V-w and V2XSet-w. Extensive experiments demonstrate that our V2X-DGW achieved significant improvements in the unseen adverse weathers.

10:00-10:05, Paper TuAT18.2	Add to My Program
The Automation of Uncrewed Aircraft Systems Traffic Management Calibration Based on Experimental Platform Data

Henderson, Thomas C.	University of Utah
Sacharny, David	University of Utah
Mello, Chad	US Air Force Academy
Raley, William	University of Utah
Keywords: Automation Technologies for Smart Cities, Intelligent Transportation Systems, Autonomous Vehicle Navigation Abstract: Many countries are developing an Urban Air Mobility (UAM) capability defining an Uncrewed Aircraft Systems (UAS) Traffic Management (UTM) architecture to allow safe UAS services in urban environments (e.g., delivery, inspection, air taxis, etc.). The main considerations are air worthiness, operator certification, air traffic management, C2 Link, detect and avoid (DAA), safety management, and security. In addition, if thousands of simultaneous UAS flights are to be achieved, it is not possible for them to be controlled individually by human operators. This makes it necessary to have a rigorous and safe automation methodology to handle such a number of flights. A lane-based airspace structure has been proposed which reduces the complexity of strategic deconfliction by providing UAS agents with a set of pre-defined airway corridors called lanes. This yields collateral benefits including UAS information privacy, robust contingency handling exploiting the lane structure, as well as improved observability and control of the air space. A robust set of UTM parameters and policies must be determined based on the performance characteristics of the deployed UAS platforms, and a methodology which constitutes a first step toward this end is proposed and demonstrated here. In order to realize this approach, a set of initial experiments have been performed to determine the constraints imposed by the UTM on UAS platform capabilities and vice versa. Initial implementation parameters and policies are defined. The major contribution here is a methodology to calibrate UTM safety parameters (e.g., headway, platform speed) in terms of specific platform models’ operational characteristics. That is, UTM parameters are a function of platform and not some arbitrarily imposed values. Safety uncertainty is then characterized by the calibration method.

10:05-10:10, Paper TuAT18.3	Add to My Program
TS-DETR: Traffic Sign Detection Based on Positive and Negative Sample Augmentation

Lin, Ching-Lun	National Chung Cheng University
Lin, Huei-Yung	National Taipei University of Technology
Wang, Chieh-Chih	National Yang Ming Chiao Tung University
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation Abstract: Traffic sign detection plays an essential role in advanced driver assistance system (ADAS) or self-driving vehicles. Typically, deep neural networks are employed to analyze road scene images captured by an onboard camera. However, due to the significant variation in appearance of different traffic signs, the classification of high similarity patterns is still a challenging task. To address these issues, this paper presents an end-to-end traffic sign detection framework based on DETR. The proposed network incorporates data augmentation and negative sample learning to mitigate the problem of data imbalance and enhance the model recognition capability effectively. An UASPP module (Upsample Atrous Pyramid Pooling) is introduced to integrate multi-scale features and global information. In the experiments, the performance evaluation has demonstrated the improvement of mAP by 3.9% on TT100K and 36.3% on GTSDB compared to state-of-the-art methods. The code and datasets are available at https://github.com/chinglun/TS-DETR.

10:10-10:15, Paper TuAT18.4	Add to My Program
A User Based HVAC System Management through Blockchain Technology and Model Predictive Control (I)

Olivieri, Giuseppe	Politecnico Di Bari
Volpe, Gaetano	Politecnico Di Bari
Mangini, Agostino Marcello	Politecnico Di Bari
Fanti, Maria Pia	Politecnico Di Bari
Keywords: Automation Technologies for Smart Cities, Building Automation, Energy and Environment-Aware Automation Abstract: This paper introduces an innovative approach to designing a user-based Heating, Ventilation, and Air- Conditioning (HVAC) system management connected with the District Energy Management System. By classifying the users into dynamic energy consumption classes to reward energy efficiency and penalize excessive use, users can modify their behavior to pass to a less expensive and more virtuous consumption class. To this aim, a blockchain platform determines the rewards and penalties and, by a K-means clustering algorithm, categorizes users into respective groups. Then, a Class Follower Problem is formulated and solved by a Model Predictive Control (MPC) strategy integrated with a Long Short-Term Memory network as a predictive model. If the users follow the suggestions proposed by the controller, i.e., the thermostat set-points and the time intervals in which the HVAC system must be switched off or on, the users can be located in a more virtuous consumption class. A case study conducted within an energy district in Bari (Italy) shows how the proposed architectural framework tuned thermal regulation in intelligent buildings while concurrently achieving energy optimization

10:15-10:20, Paper TuAT18.5	Add to My Program
Non-Parametric GNSS Integer Ambiguity Estimation Via Positional Likelihood Field Marginalization

Takanose, Aoki	National Institute of Advanced Industrial Science and Technology
Koide, Kenji	National Institute of Advanced Industrial Science and Technology
Oishi, Shuji	National Institute of Advanced Industrial Science and Technology
Yokozuka, Masashi	Nat. Inst. of Advanced Industrial Science and Technology
Keywords: Localization, Autonomous Agents, Intelligent Transportation Systems Abstract: In this paper, we propose a non-parametric method for estimating the posterior distribution of GNSS integer ambiguity. Is is difficult to estimate the posterior probability of discrete integer ambiguities directly from carrier phase observations due to the unclear domain definition. We thus introduce the positional likelihood field that accumulates the ambiguity function method values in the positional space and then estimate the integer ambiguity distributions by marginalizing the likelihood over the entire position. Because the positional likelihood field is defined in the positional space, it enables ease of carrier phase likelihood accumulation. In order to correctly estimate the posterior distribution, however, a sufficient density of the samples is required that results in a large computational cost. The proposed method enables large-scale sampling by taking advantage of GPU parallel processing. Experimental results demonstrate that the proposed method enables accurate and robust estimation of integer ambiguity distributions, contributing to improved centimeter-level position estimation accuracy. In addition, the histograms provided quantitative evidence of events in urban environments where integer ambiguity is not uniquely determined.

10:20-10:25, Paper TuAT18.6	Add to My Program
Whenever, Wherever: Towards Orchestrating Crowd Simulations with Spatio-Temporal Spawn Dynamics

Kreutz, Thomas	Technical University Darmstadt
Mühlhäuser, Max	Technical University of Darmstadt
Sanchez Guinea, Alejandro	TU Darmstadt
Keywords: Automation Technologies for Smart Cities, Modeling and Simulating Humans, Simulation and Animation Abstract: Realistic crowd simulations are essential for immersive virtual environments, relying on both individual behaviors (microscopic dynamics) and overall crowd patterns (macroscopic characteristics). While recent data-driven methods like deep reinforcement learning improve microscopic realism, they often overlook critical macroscopic features such as crowd density and flow, which are governed by spatio-temporal spawn dynamics, namely, when and where agents enter a scene. Traditional methods, like random spawn rates, stochastic processes, or fixed schedules, are not guaranteed to capture the underlying complexity or lack diversity and realism. To address this issue, we propose a novel approach called nTPP-GMM that models spatio-temporal spawn dynamics using Neural Temporal Point Processes (nTPPs) that are coupled with a spawn-conditional Gaussian Mixture Model (GMM) for agent spawn and goal positions. We evaluate our approach by orchestrating crowd simulations of three diverse real-world datasets with nTPP-GMM. Our experiments demonstrate the orchestration with nTPP-GMM leads to realistic simulations that reflect real-world crowd scenarios and allow crowd analysis.

10:25-10:30, Paper TuAT18.7	Add to My Program
RMP-YOLO: A Robust Motion Predictor for Partially Observable Scenarios Even If You Only Look Once

Sun, Jiawei	National University of Singapore
Li, Jiahui	National University of Singapore
Liu, Tingchen	National University of Singapore
Yuan, Chengran	National Universtiy of Singapore
Sun, Shuo	National University of Singapore
Huang, Zefan	National University of Singapore
Wong, Anthony	Moovita
Tee, Keng Peng	Moovita
Ang Jr, Marcelo H	National University of Singapore
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Learning, Intelligent Transportation Systems Abstract: We introduce RMP-YOLO, a unified framework designed to provide robust motion predictions even with incomplete input data. Our key insight stems from the observation that complete and reliable historical trajectory data plays a pivotal role in ensuring accurate motion prediction. Therefore, we propose a new paradigm that prioritizes the reconstruction of intact historical trajectories before feeding them into the prediction modules. Our approach introduces a novel scene tokenization module to enhance the extraction and fusion of spatial and temporal features. Following this, our proposed recovery module reconstructs agents' incomplete historical trajectories by leveraging local map topology and interactions with nearby agents. The reconstructed, clean historical data is then integrated into the downstream prediction modules. Our framework is able to effectively handle missing data of varying lengths and remains robust against observation noise, while maintaining high prediction accuracy. Furthermore, our recovery module is compatible with existing prediction models, ensuring seamless integration. Extensive experiments validate the effectiveness of our approach, and deployment in real-world autonomous vehicles confirms its practical utility. In the 2024 Waymo Motion Prediction Competition, our method, RMP-YOLO, achieves state-of-the-art performance, securing third place. Our code is open-source at https://github.com/ggosjw/RMP-YOLO.


TuAT19 Regular Session, 407	Add to My Program
Visual-Inertial Odometry

Chair: Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Co-Chair: Sanchez-Lopez, Jose Luis	University of Luxembourg

09:55-10:00, Paper TuAT19.1	Add to My Program
Leg Exoskeleton Odometry Using a Limited FOV Depth Sensor

Elnecave Xavier, Fabio	MINES Paris / Wandercraft
Viozelange, Matis	Wandercraft
Burger, Guillaume	Wandercraft
Petriaux, Marine	Wandercraft
Deschaud, Jean-Emmanuel	ARMINES
Goulette, François	MINES ParisTech
Keywords: Sensor Fusion, Mapping, Prosthetics and Exoskeletons Abstract: For leg exoskeletons to operate effectively in real-world environments, they must be able to perceive and understand the terrain around them. However, unlike other legged robots, exoskeletons face specific constraints on where depth sensors can be mounted due to the presence of a human user. These constraints lead to a limited Field Of View (FOV) and greater sensor motion, making odometry particularly challenging. To address this, we propose a novel odometry algorithm that integrates proprioceptive data from the exoskeleton with point clouds from a depth camera to produce accurate elevation maps despite these limitations. Our method builds on an extended Kalman filter (EKF) to fuse kinematic and inertial measurements, while incorporating a tailored iterative closest point (ICP) algorithm to register new point clouds with the elevation map. Experimental validation with a leg exoskeleton demonstrates that our approach reduces drift and enhances the quality of elevation maps compared to a purely proprioceptive baseline, while also outperforming a more traditional point cloud map-based variant.

10:00-10:05, Paper TuAT19.2	Add to My Program
Improving Monocular Visual-Inertial Initialization with Structureless Visual-Inertial Bundle Adjustment

Song, Junlin	University of Luxembourg
Richard, Antoine	University of Luxembourg
Olivares-Mendez, Miguel A.	Interdisciplinary Centre for Security, Reliability and Trust - U
Keywords: Localization Abstract: Monocular visual inertial odometry (VIO) has facilitated a wide range of real-time motion tracking applications, thanks to the small size of the sensor suite and low power consumption. To successfully bootstrap VIO algorithms, the initialization module is extremely important. Most initialization methods rely on the reconstruction of 3D visual point clouds. These methods suffer from high computational cost as state vector contains both motion states and 3D feature points. To address this issue, some researchers recently proposed a structureless initialization method, which can solve the initial state without recovering 3D structure. However, this method potentially compromises performance due to the decoupled estimation of rotation and translation, as well as linear constraints. To improve its accuracy, we propose novel structureless visual-inertial bundle adjustment to further refine previous structureless solution. Extensive experiments on real-world datasets show our method significantly improves the VIO initialization accuracy, while maintaining real-time performance.

10:05-10:10, Paper TuAT19.3	Add to My Program
ORB-SfMLearner: ORB-Guided Self-Supervised Visual Odometry with Selective Online Adaptation

Jin, Yanlin	Sichuan University, Rice University
Ju, Rui-Yang	National Taiwan University
Liu, Haojun	Carnegie Mellon University
Zhong, Yuzhong	Sichuan University
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, SLAM Abstract: Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through the attention weights. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.

10:10-10:15, Paper TuAT19.4	Add to My Program
QVIO2: Quantized MAP-Based Visual-Inertial Odometry

Peng, Yuxiang	University of Delaware
Chen, Chuchu	University of Delaware
Huang, Guoquan (Paul)	University of Delaware
Keywords: Visual-Inertial SLAM, Localization, SLAM Abstract: Energy-efficient visual-inertial motion tracking on SWAP-constrained edge devices (e.g., drones and AR glasses) is essential but challenging. Our previous work introduced the first-of-its-kind quantized visual-inertial odometry (QVIO), utilizing either raw measurement quantization (zQVIO) or single-bit residual quantization (rQVIO). While QVIO has demonstrated significant data transfer reduction with competitive performance, but it has limitations. Specifically, zQVIO directly quantizes raw measurements into multi-bit values, while requiring the ad-hoc inflation of measurement noise to account for quantization errors. On the other hand, rQVIO is limited to single-bit measurement with a certain accuracy loss. This work introduces QVIO2 to address these issues. The proposed QVIO2 improves data quantization strategies and derives a Maximum A Posteriori (MAP) quantized estimator that rigorously handles both multi-bit and single-bit, raw and residual quantized measurements in a unified manner. These improvements lead to more communication-efficient and accurate systems. Additionally, we optimize the communication protocol to further reduce data transfer by eliminating un- necessary transmissions. Extensive numerical and experimental results demonstrate reduced communication requirements and improved accuracy. Compared to the previous QVIO system, zQVIO2 achieves the same accuracy with a 30% reduction in data transfer, while rQVIO2 improves accuracy without increasing data communication. In real-world scenarios, our new zQVIO2 and rQVIO2 have demonstrated nearly no accuracy loss with only 4.6 bits and 3.5 bits of data communication, achieving compression rates of 7× and 9.1×.

10:15-10:20, Paper TuAT19.5	Add to My Program
Is Iteration Worth It? Revisit Its Impact in Sliding-Window VIO

Chen, Chuchu	University of Delaware
Peng, Yuxiang	University of Delaware
Huang, Guoquan (Paul)	University of Delaware
Keywords: Visual-Inertial SLAM, Sensor Fusion, Localization Abstract: Visual-inertial odometry (VIO), which fuses noisy inertial readings and camera measurements to provide 3D motion tracking, is a foundational component in many autonomous applications. With the increasing use of next-generation edge devices (e.g., IoT devices, nano drones, and mobile robotics) that are constrained by limited power, resources, and multi-tasking demands, balancing computational efficiency and accuracy in VIO estimators has become more critical than ever. Historically, state estimation algorithms have been developed using either optimization-based or filtering-based methods, with the key distinction being the ability to relinearize measurements and correct state estimates iteratively. While it has been widely claimed that iterative methods improve accuracy by allowing for the reduction of error through relinearization at a higher computational demand. Conversely, filtering methods are more efficient but may suffer from significant linearization errors. However, these trade-offs have not been thoroughly examined in the context of visual-inertial motion tracking. In this paper, we conduct the first comprehensive study on the impact of iterative algorithms in sliding-window VIO. We analyze the relinearization of IMU and camera measurements separately, providing insights into how each affects system performance. By considering key factors such as system observability and measurement processes, we offer a deeper understanding of VIO estimator behavior. Our findings, supported by proof-of-concept real-world tests, provide practical guidelines for balancing accuracy and efficiency, helping practitioners determine when to prioritize iterative methods or simpler filtering approaches while encouraging researchers and engineers to rethink VIO design for optimal resource allocation.

10:20-10:25, Paper TuAT19.6	Add to My Program
EAR-SLAM: Environment-Aware Robust Localization System for Terrestrial-Aerial Bimodal Vehicles

He, Wenjun	Harbin Engineering University
Wang, XingPeng	ZheJiang University
Wang, Pengfei	Huzhou Institute of Zhejiang University
Zhang, Tianfu	Huzhou Institute of Zhejiang University
Xu, Chao	Zhejiang University
Gao, Fei	Zhejiang University
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Keywords: SLAM, Visual-Inertial SLAM Abstract: Terrestrial-aerial bimodal vehicles (TABVs) have attracted great attention because of their advantages over single-model robots. TABVs can provide superior obstacle avoidance capability (flying mode) and safe mobility with long duration (ground mode), offering enhanced adaptability and flexibility in various challenging environments. However, a robust localization approach becomes the bottleneck to stably applying the TABVs in real-world tasks. In this paper, we present an environment-aware robust localization system specifically designed for passive-wheel-based TABVs, which feature two passive wheels alongside a standard quadrotor. The localization system tightly integrates data from multiple sensors, including a camera, Inertial Measurement Units (IMUs), encoders, and single-point laser distance sensors. First, we introduce a terrain-aware odometer model that accurately estimates terrain slope and vehicle's velocity by fusing gyroscope, encoder, and single-point laser measurements. Then, we propose an anomaly-aware method that senses anomalous sensors and dynamically adjusts the optimization weights accordingly. By explicitly estimating the environmental conditions, such as ground terrain slopes and visual information qualities, the robot can achieve accurate and robust localization results on the ground. To validate our localization approach, we conducted extensive experiments across various challenging scenarios, demonstrating the effectiveness and reliability of our system for real-world applications.

10:25-10:30, Paper TuAT19.7	Add to My Program
DynaVINS++: Robust Visual-Inertial State Estimator in Dynamic Environments by Adaptive Truncated Least Squares and Stable State Recovery

Song, Seungwon	Hyundai Motor Company
Lim, Hyungtae	Massachusetts Institute of Technology
Lee, Alex	Sookmyung Women’s University
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Visual-Inertial SLAM, Sensor Fusion, SLAM Abstract: Despite extensive research in robust visual-inertial navigation systems(VINS) in dynamic environments, many approaches remain vulnerable to objects that suddenly start moving, which are referred to as abruptly dynamic objects. In addition, most approaches have considered the effect of dynamic objects only at the feature association level. In this study, we observed that the state estimation diverges when errors from false correspondences owing to moving objects incorrectly propagate into the IMU bias terms. To overcome these problems, we propose a robust VINS framework called DynaVINS++, which employs a) adaptive truncated least square method that adaptively adjusts the truncation range using both feature association and IMU preintegration to effectively minimize the effect of the dynamic objects while reducing the computational cost, and b) stable state recovery with bias consistency check to correct misestimated IMU bias and to prevent the divergence caused by abruptly dynamic objects. As verified in both public and real-world datasets, our approach shows promising performance in dynamic environments, including scenes with abruptly dynamic objects.


TuAT20 Regular Session, 408	Add to My Program
Teleoperation

Chair: Fiorini, Paolo	University of Verona
Co-Chair: Cui, Yuchen	University of California, Los Angeles

09:55-10:00, Paper TuAT20.1	Add to My Program
A Pragmatic Approach to Bi-Directional Impedance Reflection Telemanipulation Control: Design and User Study

Lieftink, Robin	TNO, University of Twente
Falcone, Sara	University of Twente
Van Der Walt, Christophe	University of Twente
Van Erp, Jan	TNO
Dresscher, Douwe	University of Twente
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Human Factors and Human-in-the-Loop Abstract: Force feedback generally increases the effectiveness of execution and sense of embodiment on telemanipulation systems. However, systems with force feedback are vulnerable to time delays, reducing their transparency and stability. In this paper, we implement a bi-directional impedance reflection controller, a concept that was presented already in 1989 by Blake Hannaford [1] but was never fully implemented. In this method, the simplified impedances of the operator and the environment are estimated and reflected back to the remote robot and haptic interface, respectively. A trajectory predictor is added to compensate for the delayed motion. We then evaluated the effectiveness of the system in a user study, comparing it to a system with a classical bilateral impedance controller with passivity layers. Three time delay groups (0, 10, and 20 ms one-way delay) of 10 participants each executed different tasks with both controllers. The results show that the bi-directional impedance reflection controller performs significantly better in the 10 ms and 20 ms time delay groups, in terms of task performance, user experience and sense of embodiment. We conclude that this study is the first to show that bi-directional impedance reflection is robust to time delays of at least 20 ms.

10:00-10:05, Paper TuAT20.2	Add to My Program
3D Whole-Body Pose Estimation Using Graph High-Resolution Network for Humanoid Robot Teleoperation

Zhang, Mingyu	Sun Yat-Sen University
Gao, Qing	Sun Yat-Sen University
Lai, Yuanchuan	Sun Yat-Sen University
Zhang, Ye	Sun Yat-Sen University
Chang, Tao	National University of Defense Technology
Guo, Yulan	Sun Yat-Sen University
Keywords: Deep Learning for Visual Perception, Gesture, Posture and Facial Expressions, Human Detection and Tracking Abstract: In the realm of robotics, teleoperation plays a pivotal role in performing high-risk or intricate tasks, and obtaining precise 3D whole-body pose is crucial for this purpose. Traditional two-stage methods have limitations in estimating different body parts, leading to complex systems and higher estimation errors. In order to address these issues,the paper introduces a novel framework called Graph High-Resolution Network(GraphHRNet) for accurate 3D whole-body pose estimation, which is essential for the teleoperation of humanoid robots. GraphHRNet effectively captures global structural information and local details by integrating a High-Resolution Module and a Multi-branch Regression Module. The High-Resolution Module utilizes an enhanced graph convolution kernel to fuse multi-scale features, capturing global information, while the Multi-branch Regression Module focuses on refining and predicting accurate 3D coordinates for intricate body parts such as hands and face. Experimental results on the H3WB dataset demonstrate that GraphHRNet surpasses state-of-the-art(SOTA) methods in 3D whole-body pose estimation , significantly improving performance. Furthermore, the paper explores the potential application of this approach in a teleoperation system for humanoid robots, providing an intuitive and high-fidelity solution for remotely executing complex tasks. The code have been publicly available at https://github.com/Z-mingyu/GraphHRNet.git

10:05-10:10, Paper TuAT20.3	Add to My Program
Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Chakraborty, Neeloy	University of Illinois at Urbana-Champaign
Fang, Yixiao	University of Illinois at Urbana-Champaign
Schreiber, Andre	University of Illinois Urbana-Champaign
Ji, Tianchen	University of Illinois at Urbana-Champaign
Huang, Zhe	University of Illinois at Urbana-Champaign
Mihigo, Aganze	University of Illinois at Urbana-Champaign
Wall, Cassidy	University of Illinois at Urbana Champaign
Almana, Abdulrahman	University of Illinois Urbana-Champaign
Driggs-Campbell, Katherine	University of Illinois at Urbana-Champaign
Keywords: Field Robots, Telerobotics and Teleoperation, Deep Learning for Visual Perception Abstract: Teleoperation is an important technology to enable supervisors to control agricultural robots remotely. However, environmental factors in dense crop rows and limitations in network infrastructure hinder the reliability of data streamed to teleoperators. These issues result in delayed and variable frame rate video feeds that often deviate significantly from the robot's actual viewpoint. We propose a modular learning-based vision pipeline to generate delay-compensated images in real-time for supervisors. Our extensive offline evaluations demonstrate that our method generates more accurate images compared to state-of-the-art approaches in our setting. Additionally, ours is one of the few works to evaluate a delay-compensation method in outdoor field environments with complex terrain on data from a real robot in real-time. Resulting videos and code are provided at https://sites.google.com/illinois.edu/comp-teleop.

10:10-10:15, Paper TuAT20.4	Add to My Program
ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation

Liu, Wenhai	Shanghai Jiao Tong University
Wang, Junbo	Shanghai Jiao Tong University
Wang, Yiming	Shanghai Jiao Tong University
Wang, Weiming	Shanghai Jiao Tong University
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Imitation Learning, Force Control, Deep Learning in Grasping and Manipulation Abstract: In most contact-rich manipulation tasks, humans apply time-varying forces to the target object, compensating for inaccuracies in the vision-guided hand trajectory. However, current robot learning algorithms primarily focus on trajectory-based policy, with limited attention given to learning force-related skills. To address this limitation, we introduce ForceMimic, a force-centric robot learning system, providing a natural, force-aware and robot-free robotic demonstration collection system, along with a hybrid force-motion imitation learning algorithm for robust contact-rich manipulation. Using the proposed ForceCapture system, an operator can peel a zucchini in 5 minutes, while force-feedback teleoperation takes over 13 minutes and struggles with task completion. With the collected data, we propose HybridIL to train a force-centric imitation learning model, equipped with hybrid force-position control primitive to fit the predicted wrench-position parameters during robot execution. Experiments demonstrate that our approach enables the model to learn a more robust policy under the contact-rich task of vegetable peeling, increasing the success rates by 54.5% relatively compared to state-of-the-art pure-vision-based imitation learning. Hardware, code, data and more results can be found on the project website at https://forcemimic.github.io.

10:15-10:20, Paper TuAT20.5	Add to My Program
How to Train Your Robots? the Impact of Demonstration Modality on Imitation Learning

Li, Haozhuo	Stanford University
Cui, Yuchen	University of California, Los Angeles
Sadigh, Dorsa	Stanford University
Keywords: Imitation Learning, Learning from Demonstration, Data Sets for Robot Learning Abstract: Imitation learning is a promising approach for learning robot policies with user-provided data. The way demonstrations are provided, i.e., demonstration modality, influences the quality of the data. While existing research shows that kinesthetic teaching (physically guiding the robot) is preferred by users for the intuitiveness and ease of use, the majority of existing manipulation datasets were collected through teleoperation via a VR controller or spacemouse. In this work, we investigate how different demonstration modalities impact downstream learning performance as well as user experience. Specifically, we compare low-cost demonstration modalities including kinesthetic teaching, teleoperation with a VR controller, and teleoperation with a spacemouse controller. We experiment with three table-top manipulation tasks with different motion constraints. We evaluate and compare imitation learning performance using data from different demonstration modalities, and collected subjective feedback on user experience. Our results show that kinesthetic teaching is rated the most intuitive for controlling the robot and provides cleanest data for best downstream learning performance. However, it is not preferred as the way for large-scale data collection due to the physical load. Based on such insight, we propose a simple data collection scheme that relies on a small number of kinesthetic demonstrations mixed with data collected through teleoperation to achieve the best overall learning performance while maintaining low data-collection effort.

10:20-10:25, Paper TuAT20.6	Add to My Program
The Impact of Stress and Workload on Human Performance in Robot Teleoperation Tasks

Yi Ting, Sam	Georgia Institute of Technology
Hedlund-Botti, Erin	Georgia Institute of Technology
Natarajan, Manisha	Georgia Institute of Technology
Heard, Jamison	Rochester Institute of Technology
Gombolay, Matthew	Georgia Institute of Technology
Keywords: Telerobotics and Teleoperation, Cognitive Human-Robot Interaction, Human Factors and Human-in-the-Loop, Human-Centered Robotics Abstract: Advances in robot teleoperation have enabled groundbreaking innovations in many fields, such as space exploration, healthcare, and disaster relief. The human operator's performance plays a key role in the success of any teleoperation task, with prior evidence suggesting that operator stress and workload can impact task performance. As robot teleoperation is currently deployed in safety-critical domains, it is essential to analyze how different stress and workload levels impact the operator. We are unaware of any prior work investigating how both stress and workload impact teleoperation performance. We conducted a novel study (n=24) to jointly manipulate users' stress and workload and analyze the user's performance through objective and subjective measures. Our results indicate that, as stress increased, over 70% of our participants performed better up to a moderate level of stress; yet, the majority of participants performed worse as the workload increased. Importantly, our experimental design elucidated that stress and workload have related yet distinct impacts on task performance, with workload mediating the effects of distress on performance (p<.05).

10:25-10:30, Paper TuAT20.7	Add to My Program
Adaptive User Interface with Parallel Neural Networks for Robot Teleoperation

SharafianArdakani, Payman	UofL
Hanafy, Mohamed A.	University of Louisville
Kondaurova, Irina	UofL
Ashary, Ali	University of Louisville
Rayguru, Madan Mohan	Delhi Technological University
Popa, Dan	University of Louisville
Keywords: Telerobotics and Teleoperation, Human Performance Augmentation, Virtual Reality and Interfaces Abstract: In recent years, human-robot interaction (HRI) has become an increasingly important field of research. The human experience during HRI tasks like teleoperation or turn-taking largely depends on the interface design between the robot and the user. Designing an intuitive user interface (UI) between an arbitrary M-dimensional input device and an N-degree of freedom (DOF) robot remains a significant challenge. This paper proposes a novel UI design approach named the Parallel Neural Networks Adaptive User Interface (PNNUI). PNNUI utilizes two parallel neural networks to learn and then improve the teleoperation performance of users by minimizing task completion time and maximizing motion smoothness. Our method is designed to learn an unintuitive input-output map between user interface hardware and the robot by minimizing task completion time in an offline unsupervised learning scheme based on Neural Networks (NNs) and Genetic Algorithms. Secondly, PNNUI minimizes teleoperation jerk online by adapting the weights of a parallel neural network. We experimentally evaluated the resulting UI for teleoperating a 3-DOF nonholonomic robot through a conventional joystick with three inputs. Twenty human subjects operated the robot along an obstacle course in several conditions. The statistical analysis of the user trial data shows that PNNUI improves the human experience in robot teleoperation by maximizing smoothness while maintaining the completion time of the offline learning scheme. Furthermore, the abstract nature of our formulation enables the customization of performance measures, which extends its applicability to other interface devices and HRI tasks, particularly those that are not intuitive to start with.


TuAT21 Regular Session, 410	Add to My Program
Reinforcement Learning 1

Chair: Song, WenZhan	University of Georgia
Co-Chair: Kantaros, Yiannis	Washington University in St. Louis

09:55-10:00, Paper TuAT21.1	Add to My Program
Hierarchical Visual Policy Learning for Long-Horizon Robot Manipulation in Densely Cluttered Scenes

Wang, Hecheng	Fudan University
Qi, Lizhe	Fudan University
Wang, Ziheng	Academy for Engineering & Technology, Fudan University
Ren, Jiankun	Fudan University
Li, Wei	Fudan University
Sun, Yunquan	Fudan University
Keywords: Reinforcement Learning, Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: In this work, we focus on addressing the long-horizon packing tasks in densely cluttered scenes. Such tasks require policies to effectively manage severe occlusions among objects and continually produce precise actions based on visual observations. We propose a vision-based Hierarchical policy for Cluttered-scene Long-horizon Manipulation (HCLM). It employs a high-level policy and three options to select and instantiate three parameterized action primitives: push, pick, and place. We first train the two-stream pick and place options by behavior cloning (BC). Subsequently, we use hierarchical reinforcement learning (HRL) to train the high-level policy and push option. During HRL, we propose a Spatially Extended Q-update (SEQ) to augment the updates for the push option and a Two-Stage Update Scheme (TSUS) to alleviate the non-stationary transition problem in updating the high-level policy. We demonstrate that HCLM significantly outperforms baseline methods in terms of success rate and efficiency in diverse tasks both in simulation and real world. The ablation studies also validate the key roles of SEQ and TSUS in HRL.

10:00-10:05, Paper TuAT21.2	Add to My Program
AERAS: Adaptive Experience Replay with Attention-Based Sequence Embedding for Improved Multi-Agent Reinforcement Learning

Xie, Zaipeng	Hohai University
Shen, Sitong	Hohai University
Wang, Yaowu	Hohai University
Fang, Wenhao	Hohai University
Song, WenZhan	University of Georgia
Keywords: Reinforcement Learning, Agent-Based Systems, Autonomous Agents Abstract: Multi-agent systems in non-stationary environments face challenges due to rapidly changing dynamics, leading to quick obsolescence of experiences in the replay buffer. To address this, we propose the Adaptive Experience Replay with Attention-Based Sequence Embedding (AERAS) framework, which integrates sequence embedding with an attention mechanism to prioritize experiences based on their relevance. By assigning adaptive weights, AERAS emphasizes relevant experiences while diminishing the impact of outdated ones, enhancing efficiency and learning performance in multi-agent reinforcement learning. Evaluations on the StarCraft II Multi-Agent Challenge and Google Research Football environments show that AERAS consistently outperforms state-of-the-art methods, achieving faster convergence and higher win rates. Ablation studies confirm the essential roles of sequence embedding and attention mechanisms in boosting AERAS's robustness and adaptability, underscoring its effectiveness in managing non-stationary environments within multi-agent systems.

10:05-10:10, Paper TuAT21.3	Add to My Program
Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

Liu, Ziang	East China Normal University
Xu, Junjie	East China Normal University
Wu, XingJiao	East China Normal University
Yang, Jing	East China Normal University
He, Liang	East China Normal University
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Deep Learning Methods Abstract: Preference-Based reinforcement learning (PBRL) learns directly from the preferences of human teachers regarding agent behaviors without needing meticulously designed reward functions. However, existing PBRL methods often learn primarily from explicit preferences, neglecting the possibility that teachers may choose equal preferences. This neglect may hinder the understanding of the agent regarding the task perspective of the teacher, leading to the loss of important information. To address this issue, we introduce the Equal Preference Learning Task, which optimizes the neural network by promoting similar reward predictions when the behaviors of two agents are labeled as equal preferences. Building on this task, we propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences. To validate our approach, we design experiments applying MTPL to four existing state-of-the-art baselines across ten locomotion and robotic manipulation tasks in the DeepMind Control Suite. The experimental results indicate that simultaneous learning from both equal and explicit preferences enables the PBRL method to more comprehensively understand the feedback from teachers, thereby enhancing feedback efficiency. Project page: url{https://github.com/FeiCuiLengMMbb/paper_MTPL}

10:10-10:15, Paper TuAT21.4	Add to My Program
Neural Lyapunov Function Approximation with Self-Supervised Reinforcement Learning

McCutcheon, Luc Harold Lucien	University of Surrey
Gharesifard, Bahman	UCLA
Fallah, Saber	University of Surrey
Keywords: Reinforcement Learning, Robot Safety, Machine Learning for Robot Control Abstract: Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample-efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self-supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data-driven World Model to train Lyapunov functions from off-policy trajectories. The method is validated on both standard and goal-conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state-of-the-art neural Lyapunov approximation baseline. The code is available at: https://github.com/CAV-Research-Lab/SACLA.git

10:15-10:20, Paper TuAT21.5	Add to My Program
SuPLE: Robot Learning with Lyapunov Rewards

Nguyen, Phu	San Jose State University
Polani, Daniel	University of Hertfordshire
Tiomkin, Stas	Texas Tech University
Keywords: Reinforcement Learning, Machine Learning for Robot Control Abstract: The reward function is an essential component in robot learning. Reward directly affects the sample and computational complexity of learning, and the quality of a solution. The design of informative rewards requires domain knowledge, which is not always available. We use the properties of the dynamics to produce system-appropriate reward without adding external assumptions. Specifically, we explore an approach to utilize the Lyapunov exponents of the system dynamics to generate a system-immanent reward. We demonstrate that the Sum of the Positive Lyapunov Exponents (SuPLE) is a strong candidate for the design of such a reward. We develop a computational framework for the derivation of this reward, and demonstrate its effectiveness on classical benchmarks for sample-based stabilization of various dynamical systems. It eliminates the need to start the training trajectories at arbitrary states, also known as auxiliary exploration. While the latter is a common practice in simulated robot learning, it is unpractical to consider to use it in real robotic systems, since they typically start from natural rest states such as a pendulum at the bottom, a robot on the ground, etc. and can not be easily initialized at arbitrary states. Comparing the performance of SuPLE to commonly-used reward functions, we observe that the latter fail to find a solution without auxiliary exploration, even for the task of swinging up the double pendulum and keeping it stable at the upright position, a prototypical scenario for multi- linked robots. SuPLE-induced rewards for robot learning offer a novel route for effective robot learning in typical as opposed to highly specialized or fine-tuned scenarios. Our code is publicly available for reproducibility and further research.

10:20-10:25, Paper TuAT21.6	Add to My Program
SpeedTuning: Speeding up Policy Execution with Lightweight Reinforcement Learning

Yuan, David D.	Stanford University
Zhao, Zihao	Stanford University
Burns, Kaylee	Stanford University
Finn, Chelsea	Stanford University
Keywords: Reinforcement Learning, Imitation Learning, Deep Learning Methods Abstract: While learned robotic policies hold promise for advancing generalizable manipulation, their practical deployment is often hindered by suboptimal execution speeds. Imitation learning policies are inherently limited by hardware constraints and the speed of the operator during data collection. In addition, there are no established methods for accelerating policies learned via imitation, and the empirical relationship between execution speed and task success remains underexplored. To address these issues, we introduce SpeedTuning, a reinforcement learning framework specifically designed to enhance the speed of manipulation policies. SpeedTuning learns to predict the optimal execution speed for actions, thereby complementing a base policy without necessitating additional data collection. We provide empirical evidence that SpeedTuning achieves substantial improvements in execution speed, exceeding 2.4x speed-up, while preserving an adequate success rate compared to both the original task policy and straightforward speed-up methods such as linear interpolation at a fixed speed. We evaluate our approach across a diverse set of dynamic and precise tasks, including pouring, throwing, and picking, demonstrating its effectiveness and robustness in enhancing real-world robotic manipulation. Videos and code are available at https://github.com/DaivdYuan/SpeedTuning

10:25-10:30, Paper TuAT21.7	Add to My Program
Simplifying Reward Design in Complex Robotics: Average-Reward Maximum Entropy Reinforcement Learning

Choe, Jean Seong Bjorn	Korea University
Choi, Bumkyu	Korea University
Kim, Jong-kook	Korea Univeristy
Keywords: Reinforcement Learning, Underactuated Robots, Robust/Adaptive Control Abstract: This paper presents a novel approach to addressing the control challenges of underactuated systems, focusing on the swing-up and stabilisation tasks on the double pendulum system. We propose the Average-Reward Entropy Advantage Policy Optimisation (AR-EAPO), a model-free reinforcement learning (RL) algorithm that integrates the strengths of the average-reward RL and the maximum entropy RL (MaxEnt RL). The average reward criterion allows the use of a simple reward function by naturally promoting the long-term goals, at the same time MaxEnt RL encourages the robustness of the policy. We validate our approach through simulations, consistently outperforming standard RL baselines and traditional control methods. Also, we provide preliminary test results on real double pendulum hardware. Additional experiments on MuJoCo environments further demonstrate AR-EAPO's efficacy on general continuous control tasks. This work underscores the potential of the average-reward criterion in simplifying control design while achieving superior results.


TuAT22 Regular Session, 411	Add to My Program
Learning Based Planning for Manipulation 1

Chair: Hermans, Tucker	University of Utah
Co-Chair: Pompili, Dario	Rutgers University

09:55-10:00, Paper TuAT22.1	Add to My Program
Multi-Stage Reinforcement Learning for Non-Prehensile Manipulation

Wang, Dexin	Shandong University
Liu, Chunsheng	Shandong University
Chang, Faliang	Shandong University
Huan, Hengqiang	Shandong University
Cheng, Kun	Shandong University
Keywords: Grasping, Manipulation Planning, Reinforcement Learning Abstract: Manipulating objects without grasping them facilitates complex tasks, known as non-prehensile manipulation. Most previous methods are limited to learning a single skill to manipulate objects with primitive shapes, and are unserviceable for flexible object manipulation that requires a combination of multiple skills. We explore skill-unconstrained non-prehensile manipulation, and propose a Multi-stage Reinforcement Learning for Non-prehensile Manipulation (MRLNM), which calculates a intermediate state between the initial and goal states and divides the task into multiple stages for sequential learning. At each stage, the policy takes the desired 6-DOF object pose as the goal, and proposes a spatially-continuous action, allowing the robot to explore arbitrary skills to accomplish the task. To handle objects with different shapes, we propose a State-Goal Fusion Representation (SGF-Representation) to represent observations and goals as point clouds with motion, which improves the policy's perception of scene layout and task goal. To improve sample efficiency, we propose a Spatially-Reachable Distance Metric (SR-Distance) to approximately measure the shortest distance between two points without intersecting the scene. We evaluate MRLNM on an occluded grasping task which aims to grasp the object in initially occluded configurations. MRLNM demonstrates strong generalization to unseen objects with shapes outside the training distribution and can be transferred to the real world with zero-shot transfer, achieving a 95% success rate.

10:00-10:05, Paper TuAT22.2	Add to My Program
Points2Plans: From Point Clouds to Long-Horizon Plans with Composable Relational Dynamics

Huang, Yixuan	University of Utah
Agia, Christopher George	Stanford University
Wu, Jimmy	Princeton University
Hermans, Tucker	University of Utah
Bohg, Jeannette	Stanford University
Keywords: Deep Learning in Grasping and Manipulation, Mobile Manipulation, Manipulation Planning Abstract: We present Points2Plans, a framework for composable planning with a relational dynamics model that enables robots to solve long-horizon manipulation tasks from partial-view point clouds. Given a language instruction and a point cloud of the scene, our framework initiates a hierarchical planning procedure, whereby a language model generates a high-level plan and a sampling-based planner produces constraint-satisfying continuous parameters for manipulation primitives sequenced according to the high-level plan. Key to our approach is the use of a relational dynamics model as a unifying interface between the continuous and symbolic representations of states and actions, thus facilitating language-driven planning from high-dimensional perceptual input such as point clouds. Whereas previous relational dynamics models require training on datasets of multi-step manipulation scenarios that align with the intended test scenarios, Points2Plans uses only single-step simulated training data while generalizing zero-shot to a variable number of steps during real-world evaluations. We evaluate our approach on tasks involving geometric reasoning, multi-object interactions, and occluded object reasoning in both simulated and real-world settings. Results demonstrate that Points2Plans offers strong generalization to unseen long-horizon tasks in the real world, where it solves over 85% of evaluated tasks while the next best baseline solves only 50%.

10:05-10:10, Paper TuAT22.3	Add to My Program
Retrieval-Augmented Hierarchical In-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs

Sun, Chuanneng	Rutgers University
Huang, Songjun	Rutgers University
Liu, Haiqiao	Rutgers University
Gong, Jie	Rutgers University
Pompili, Dario	Rutgers University
Keywords: AI-Based Methods, Reinforcement Learning, Agent-Based Systems Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities in various language tasks, making them promising candidates for decision-making in robotics. Inspired by Hierarchical Reinforcement Learning (HRL), we propose Retrieval-Augmented in-context reinforcement Learning (RAHL), a novel framework that decomposes complex tasks into sub-tasks using an LLM-based high-level policy, in which a complex task is decomposed into sub-tasks by a high-level policy on-the-fly. The sub-tasks, defined by goals, are assigned to the low-level policy to complete. To improve the agent's performance in multi-episode execution, we propose Hindsight Modular Reflection (HMR), where, instead of reflecting on the full trajectory, we let the agent reflect on shorter sub-trajectories to improve reflection efficiency. We evaluate the decision-making ability of the proposed RAHL in three benchmark environments--ALFWorld, Webshop, and HotpotQA. Results show that RAHL can achieve 9%, 42%, and 10% performance improvement in 5 episodes of execution over strong baselines. Furthermore, we also implemented RAHL on the Boston Dynamics SPOT robot. The experiment shows that the robot is able to scan the environment, find doorways, and navigate to new rooms controlled by the LLM policy.

10:10-10:15, Paper TuAT22.4	Add to My Program
Automatic Behavior Tree Expansion with LLMs for Robotic Manipulation

Styrud, Jonathan	ABB
Iovino, Matteo	ABB Corporate Research
Norrlöf, Mikael	Linköping University
Björkman, Mårten	KTH
Smith, Claes Christian	KTH Royal Institute of Technology
Keywords: AI-Enabled Robotics, AI-Based Methods, Behavior-Based Systems Abstract: Robotic systems for manipulation tasks are increasingly expected to be easy to configure for new tasks or unpredictable environments, while keeping a transparent policy that is readable and verifiable by humans. We propose the method BEhavior TRee eXPansion with Large Language Models (ours) to dynamically and automatically expand and configure Behavior Trees as policies for robot control. The method utilizes an LLM to resolve errors outside the task planner's capabilities, both during planning and execution. We show that the method is able to solve a variety of tasks and failures and permanently update the policy to handle similar problems in the future.

10:15-10:20, Paper TuAT22.5	Add to My Program
LLM-As-BT-Planner: Leveraging LLMs for Behavior Tree Generation in Robot Task Planning

Ao, Jicong	Technical University Munich
Wu, Fan	Technical University of Munich
Wu, Yansong	Technische Universität München
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Behavior-Based Systems, AI-Enabled Robotics, Assembly Abstract: Robotic assembly tasks remain an open challenge due to their long horizon nature and complex part relations. Behavior trees (BTs) are increasingly used in robot task planning for their modularity and flexibility, but creating them manually can be effort-intensive. Large language models (LLMs) have recently been applied to robotic task planning for generating action sequences, yet their ability to generate BTs has not been fully investigated. To this end, we propose LLM-as-BT-Planner, a novel framework that leverages LLMs for BT generation in robotic assembly task planning. Four in-context learning methods are introduced to utilize the natural language processing and inference capabilities of LLMs for producing task plans in BT format, reducing manual effort while ensuring robustness and comprehensibility. Additionally, we evaluate the performance of fine-tuned smaller LLMs on the same tasks. Experiments in both simulated and real-world settings demonstrate that our framework enhances LLMs' ability to generate BTs, improving success rate through in-context learning and supervised fine-tuning.

10:20-10:25, Paper TuAT22.6	Add to My Program
Enhancing Multi-Agent Systems Via Reinforcement Learning with LLM-Based Planner and Graph-Based Policy

Jia, Ziqi	Tsinghua University
Li, Junjie	Huazhong University of Science and Technology
Qu, Xiaoyang	Ping an Technology (Shenzhen)
Wang, Jianzong	Ping an Technology (Shenzhen)
Keywords: AI-Based Methods, AI-Enabled Robotics, Multi-Robot Systems Abstract: Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environ ments. To address these challenges, we propose LGC-MARL, a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collab oration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC MARL in completing various complex tasks.

10:25-10:30, Paper TuAT22.7	Add to My Program
A Black-Box Physics-Informed Estimator Based on Gaussian Process Regression for Robot Inverse Dynamics Identification

Giacomuzzo, Giulio	University of Padova
Carli, Ruggero	University of Padova
Romeres, Diego	Mitsubishi Electric Research Laboratories
Dalla Libera, Alberto	University of Padova
Keywords: Dynamics, Calibration and Identification, Model Learning for Control, Gaussian Process Regression Abstract: Learning the inverse dynamics of robots directly from data, adopting a black-box approach, is interesting for several real-world scenarios where limited knowledge about the system is available. In this paper, we propose a black-box model based on Gaussian Process (GP) Regression for the identification of the inverse dynamics of robotic manipulators. The proposed model relies on a novel multidimensional kernel, called Lagrangian Inspired Polynomial (LIP) kernel. The LIP kernel is based on two main ideas. First, instead of directly modeling the inverse dynamics components, we model as GPs the kinetic and potential energy of the system. The GP prior on the inverse dynamics components is derived from those on the energies by applying the properties of GPs under linear operators. Second, as regards the energy prior definition, we prove a polynomial structure of the kinetic and potential energy, and we derive a polynomial kernel that encodes this property. As a consequence, the proposed model allows also to estimate the kinetic and potential energy without requiring any label on these quantities. Results on simulation and on two real robotic manipulators, namely a 7 DOF Franka Emika Pa


TuAT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Navigation 1

Chair: Kunze, Lars	UWE Bristol
Co-Chair: Otte, Michael W.	University of Maryland

09:55-10:00, Paper TuAT23.1	Add to My Program
Annealed Winner-Takes-All for Motion Forecasting

Xu, Yihong	Valeo.ai
Letzelter, Victor	Telecom ParisTech, Valeo AI
Chen, Mickaël	Valeo
Zablocki, Eloi	Valeo
Cord, Matthieu	Sorbonne Université, Valeo.ai
Keywords: Autonomous Vehicle Navigation, Computer Vision for Automation, Vision-Based Navigation Abstract: In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, commonly addressed using data-driven models with Multiple Choice Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives. However, these methods face initialization sensitivity and training instabilities. Additionally, to compensate for limited performance, some approaches rely on training with a large set of hypotheses, requiring a post-selection step during inference to significantly reduce the number of predictions. To tackle these issues, we take inspiration from annealed MCL, a recently introduced technique that improves the convergence properties of MCL methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we demonstrate how the aWTA loss can be integrated with state-of-the-art motion forecasting models to enhance their performance using only a minimal set of hypotheses, eliminating the need for the cumbersome post-selection step. Our approach can be easily incorporated into any trajectory prediction model normally trained using WTA and yields significant improvements. To facilitate the application of our approach to future motion forecasting models, the code will be made publicly available upon acceptance.

10:00-10:05, Paper TuAT23.2	Add to My Program
Causal Contrastive Learning with Data Augmentations for Imitation-Based Planning

Xin, Haojie	Xi'an Jiaotong University
Zhang, Xiaodong	Xidian University
Yan, Songyang	Xi'an Jiaotong University
Sun, Jun	Singapore Management University
Yang, Zijiang	University of Science and Technology of China
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: Motion planning is a difficult task, especially when generating feasible future trajectories in complex and interactive scenarios. While recent advancements in imitation-based planning have shown significant progress, this approach often encounters causal confusion in dynamic traffic environments. This confusion will cause the planner to incorrectly associate certain actions with outcomes, leading to suboptimal or unsafe plans. To address this, we introduce a novel framework called C2L, which improves the planner’s latent Causal understanding by incorporating Contrastive Learning and counterfactual data augmentation. Additionally, we propose a shortcut eliminator to extract copycat-free features from history states, reducing the impact of temporal spurious correlations. We validate our method on the nuPlan and interPlan benchmarks, with extensive experiments demonstrating that C2L delivers highly competitive performance compared to state-of-the-art methods.

10:05-10:10, Paper TuAT23.3	Add to My Program
Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving

Xiao, Lingyu	Southeast University
Liu, Jiang-Jiang	Baidu
Yang, Sen	Baidu
Li, Xiaofan	Baidu
Ye, Xiaoqing	Baidu Inc
Yang, Wankou	Southeast University
Wang, Jingdong	Baidu
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Control, Intelligent Transportation Systems Abstract: The autoregressive world model exhibits robust generalization capabilities in vectorized scene understanding but encounters difﬁculties in deriving actions due to insufﬁcient uncertainty modeling and self-delusion. In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses. We propose LatentDriver, a framework models the environment’s next states and the ego vehicle’s possible actions as a mixture distribution, from which a deterministic control signal is then derived. By incorporating mixture modeling, the stochastic nature of decision-making is captured. Additionally, the self-delusion problem is mitigated by providing intermediate actions sampled from a distribution to the world model. Experimental results on the recently released close-loop benchmark Waymax demonstrate that LatentDriver surpasses state-of-the-art reinforcement learning and imitation learning methods, achieving expert-level performance. The code and models will be made available at https://github.com/Sephirex-X/LatentDriver.

10:10-10:15, Paper TuAT23.4	Add to My Program
Autonomous Wheel Loader Navigation Using Goal-Conditioned Actor-Critic MPC

Mäki-Penttilä, Aleksi	Tampere University
Toulkani, Naeim Ebrahimi	Tampere University
Ghabcheloo, Reza	Tampere University
Keywords: Autonomous Vehicle Navigation, Optimization and Optimal Control, Motion and Path Planning Abstract: This paper proposes a novel control method for an autonomous wheel loader, enabling time-efficient navigation to an arbitrary goal pose. Unlike prior works which combine high-level trajectory planners with Model Predictive Control (MPC), we directly enhance the planning capabilities of MPC by incorporating a cost function derived from Actor-Critic Reinforcement Learning (RL). Specifically, we first train an RL agent to solve the pose reaching task in simulation, then transfer the learned planning knowledge to an MPC by incorporating the trained neural network critic as both the stage and terminal cost. We show through comprehensive simulations that the resulting MPC inherits the time-efficient behavior of the RL agent, generating trajectories that compare favorably against those found using trajectory optimization. We also deploy our method on a real-world wheel loader, where we demonstrate successful navigation in various scenarios.

10:15-10:20, Paper TuAT23.5	Add to My Program
Unlock the Power of Unlabeled Data in Language Driving Model

Wang, Chaoqun	The Chinese University of Hong Kong, Shenzhen
Yang, Jie	The Chinese University of Hong Kong, Shenzhen
Hong, Xiaobin	Nanjing University
Zhang, Ruimao	The Chinese University of Hong Kong (Shenzhen)
Keywords: Autonomous Vehicle Navigation, Vision-Based Navigation, Intelligent Transportation Systems Abstract: Recent Vision-based Large Language Models (VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e.g., InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.

10:20-10:25, Paper TuAT23.6	Add to My Program
CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving

Zhang, Junrui	University of Science & Technology of China
Wang, Chenjie	Institute of Artificial Intelligence, Hefei Comprehensive Nation
Peng, Jie	University of Science and Technology of China
Li, Haoyu	University of Science and Technology of China
Ji, Jianmin	University of Science and Technology of China
Zhang, Yu	University of Science and Technology of China
Zhang, Yanyong	University of Science and Technology of China
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Intelligent Transportation Systems Abstract: Imitation learning based planning tasks on the nuPlan dataset have gained great interest due to their potential to generate human-like driving behaviors. However, open-loop training on the nuPlan dataset tends to cause causal confusion during closed-loop testing, and the dataset also presents a long-tail distribution of scenarios. These issues introduce challenges for imitation learning. To tackle these problems, we introduce CAFE-AD, a Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving method, designed to enhance feature representation across various scenario types. We develop an adaptive feature pruning module that ranks feature importance to capture the most relevant information while reducing the interference of noisy information during training. Moreover, we propose a cross-scenario feature interpolation module that enhances scenario information to introduce diversity, enabling the network to alleviate over-fitting in dominant scenarios. We evaluate our method CAFE-AD, on the challenging public nuPlan Test14-Hard closed-loop simulation benchmark. The results demonstrate that CAFE-AD outperforms state-of-the-art methods including rule-based and hybrid planners, and exhibits the potential in mitigating the impact of long-tail distribution within the dataset. Additionally, we further validate its effectiveness in real-world environments. The code and models will be made available at https://github.com/AlniyatRui/CAFE-AD.

10:25-10:30, Paper TuAT23.7	Add to My Program
Beyond Simulation: Benchmarking World Models for Planning and Causality in Autonomous Driving

Schofield, Hunter	York University
Elmahgiubi, Mohammed	Huawei Technologies Inc
Rezaee, Kasra	Huawei Technologies
Shan, Jinjun	York University
Keywords: Autonomous Vehicle Navigation, Autonomous Agents, Motion and Path Planning Abstract: World models have become increasingly popular in acting as learned traffic simulators. Recent work has explored replacing traditional traffic simulators with world models for policy training. In this work, we explore the robustness of existing metrics to evaluate world models as traffic simulators to see if the same metrics are suitable for evaluating a world model as a pseudo-environment for policy training. Specifically, we analyze the metametric employed by the Waymo Open Sim-Agents Challenge (WOSAC) and compare world model predictions on standard scenarios where the agents are fully or partially controlled by the world model (partial replay). Furthermore, since we are interested in evaluating the ego action-conditioned world model, we extend the standard WOSAC evaluation domain to include agents that are causal to the ego vehicle. Our evaluations reveal a significant number of scenarios where top-ranking models perform well under no perturbation but fail when the ego agent is forced to replay the original trajectory. To address these cases, we propose new metrics to highlight the sensitivity of world models to uncontrollable objects and evaluate the performance of world models as pseudo-environments for policy training and analyze some state-of-the-art world models under these new metrics.


TuAT24 Regular Session, 401	Add to My Program
Testing and Validation

Chair: Hollis, Ralph	Carnegie Mellon University
Co-Chair: Heckman, Christoffer	University of Colorado at Boulder

09:55-10:00, Paper TuAT24.1	Add to My Program
Enhancing Repeatability and Reliability of Accelerated Risk Assessment in Robot Testing

Capito, Linda	Transportation Research Center Inc. C/o NHTSA
Castillo, Guillermo A.	The Ohio State University
Weng, Bowen	Iowa State University
Keywords: Probability and Statistical Methods, Performance Evaluation and Benchmarking, Legged Robots Abstract: Risk assessment of a robot in controlled environments, such as laboratories and proving grounds, is a common means to assess, certify, validate, verify, and characterize the robots' safety performance before, during, and even after their commercialization in the real-world. A standard testing program that acquires the risk estimate is expected to be (i) repeatable, such that it obtains similar risk assessments of the same testing subject among multiple trials or attempts with the similar testing effort by different stakeholders, and (ii) reliable against a variety of testing subjects produced by different vendors and manufacturers. Both repeatability and reliability are fundamental and crucial for a testing algorithm's validity, fairness, and practical feasibility, especially for standardization. However, these properties are rarely satisfied or ensured, especially as the subject robots become more complex, uncertain, and varied. This issue was present in traditional risk assessments through Monte-Carlo sampling, and remains a bottleneck for the recent accelerated risk assessment methods, primarily those using importance sampling. This study aims to enhance existing accelerated testing frameworks by proposing a new algorithm that provably integrates repeatability and reliability with the already established formality and efficiency. It also features demonstrations assessing the risk of instability from frontal impacts, initiated by push-over disturbances on a controlled inverted pendulum and a 7-DoF planar bipedal robot Rabbit managed by various control algorithms.

10:00-10:05, Paper TuAT24.2	Add to My Program
Learning-Based Bayesian Inference for Testing of Autonomous Systems

Parashar, Anjali	MIT
Yin, Ji	Georgia Institute of Technology
Dawson, Charles	MIT
Tsiotras, Panagiotis	Georgia Tech
Fan, Chuchu	Massachusetts Institute of Technology
Keywords: Formal Methods in Robotics and Automation, Robot Safety, Hybrid Logical/Dynamical Planning and Verification Abstract: For the safe operation of robotic systems, it is important to accurately understand its failure modes using prior testing. Hardware testing of robotic infrastructure is known to be slow and costly. Instead, failure prediction in simulation can help to analyze the system before deployment. Conventionally, large-scale naive Monte Carlo simulations are used for testing; however, this method is only suitable for testing average system performance. For safety-critical systems, worst-case performance is more crucial as failures are often rare events, and the size of test batches increases substantially as failures become more rare. Rare-event sampling methods can be helpful; however, they exhibit slow convergence and cannot handle constraints. This research introduces a novel sampling-based testing framework for autonomous systems which bridges these gaps by utilizing a discretized gradient-based second-order Langevin algorithm combined with learning-based techniques for constrained sampling of failure modes. Our method can predict more diverse failures by exploring the search space efficiently and ensures feasibility with respect to temporal and implicit constraints. We demonstrate the use of our testing methodology on two categories of testing problems, via simulations and hardware experiments. Our method discovers up to 2X failures compared to naive Random Walk sampling, with only half of the sample size.

10:05-10:10, Paper TuAT24.3	Add to My Program
Foundation Models for Rapid Autonomy Validation

Farid, Alec	Princeton
Schleede, Peter	Zoox
Huang, Aaron	Zoox
Heckman, Christoffer	University of Colorado at Boulder
Keywords: Performance Evaluation and Benchmarking, Deep Learning Methods, Representation Learning Abstract: We are motivated by the problem of autonomous vehicle performance validation. A key challenge is that an autonomous vehicle requires testing in every kind of driving scenario it could encounter, including rare events, to provide a strong case for safety and show there is no edge-case pathological behavior. Autonomous vehicle companies rely on potentially millions of miles driven in realistic simulation to expose the driving stack to enough miles to estimate rates and severity of collisions. To address scalability and coverage, we propose the use of a behavior foundation model, specifically a masked autoencoder (MAE), trained to reconstruct driving scenarios. We leverage the foundation model in two complementary ways: we (i) use the learned embedding space to group qualitatively similar scenarios together and (ii) fine-tune the model to label scenario difficulty based on the likelihood of a collision upon simulation. We use the difficulty scoring as importance weighting for the groups of scenarios. The result is an approach which can more rapidly estimate the rates and severity of collisions by prioritizing hard scenarios while ensuring exposure to every kind of driving scenario.

10:10-10:15, Paper TuAT24.4	Add to My Program
The Mini Wheelbot: A Testbed for Learning-Based Balancing, Flips, and Articulated Driving

Hose, Henrik	Institute for Data Science in Mechanical Engineering (DSME), RWT
Weisgerber, Jan Luca	RWTH Aachen
Trimpe, Sebastian	RWTH Aachen University
Keywords: Wheeled Robots, Underactuated Robots, Machine Learning for Robot Control Abstract: The Mini Wheelbot is a balancing, reaction wheel unicycle robot designed as a testbed for learning-based control. It is an unstable system with highly nonlinear yaw dynamics, non-holonomic driving, and discrete contact switches in a small, powerful, and rugged form factor. The Mini Wheelbot can use its wheels to stand up from any initial orientation - enabling automatic environment resets in repetitive experiments and even challenging half flips. We illustrate the effectiveness of the Mini Wheelbot as a testbed by implementing two popular learning-based control algorithms. First, we showcase Bayesian optimization for tuning the balancing controller. Second, we use imitation learning from an expert nonlinear MPC that uses gyroscopic effects to reorient the robot and can track higher-level velocity and orientation commands. The latter allows the robot to drive around based on user commands - for the first time in this class of robots. The Mini Wheelbot is not only compelling for testing learning-based control algorithms, but it is also just fun to work with, as demonstrated in the video of our experiments.

10:15-10:20, Paper TuAT24.5	Add to My Program
The Impact of Sensor Faults on Connected Autonomous Vehicle Localization

Kuwada, Shinsaku	Illinois Institute of Technology
Joerger, Mathieu	Virginia Tech
Spenko, Matthew	Illinois Institute of Technology
Keywords: Probability and Statistical Methods, Localization, Multi-Robot Systems Abstract: Connected autonomous vehicles (CAVs) can provide benefits over individual vehicles for precise navigation, especially in GNSS-denied environments. CAV collaboration can enhance estimation accuracy, but the safety of collaborative localization in the presence of undetected sensor faults remains underexplored. This paper introduces an integrity monitoring method for CAV collaborative localization in both centralized and decentralized implementations. Fault models for landmark and relative measurements are described, and the probability of hazardous misleading information, or integrity risk, is derived. Simulation and experimental results for notional two-CAV scenarios indicate that collaborative localization reduces integrity risk and enhances navigation safety.

10:20-10:25, Paper TuAT24.6	Add to My Program
Realistic Extreme Behavior Generation for Improved AV Testing

Dyro, Robert	Stanford University
Foutter, Matthew	Stanford University
Li, Ruolin	Stanford
Di Lillo, Luigi	Swiss Reinsurance Company, Ltd; Autonomous Systems Lab, Stanford
Schmerling, Edward	Stanford University
Zhou, Xilin	Swiss Re
Pavone, Marco	Stanford University
Keywords: Performance Evaluation and Benchmarking, Optimization and Optimal Control Abstract: This work introduces a framework to diagnose the strengths and shortcomings of Autonomous Vehicle (AV) collision avoidance technology with synthetic yet realistic potential collision scenarios adapted from real-world, collision-free data. Our framework generates counterfactual collisions with diverse crash properties, e.g., crash angle and velocity, between an adversary and a target vehicle by adding perturbations to the adversary's predicted trajectory from a learned AV behavior model. Our main contribution is to ground these adversarial perturbations in realistic behavior as defined through the lens of data-alignment in the behavior model's parameter space. Then, we cluster these synthetic counterfactuals to identify plausible and representative collision scenarios to form the basis of a test suite for downstream AV system evaluation. We demonstrate our framework using two state-of-the-art behavior prediction models as sources of realistic adversarial perturbations, and show that our scenario clustering evokes interpretable failure modes from a baseline AV policy under evaluation.

10:25-10:30, Paper TuAT24.7	Add to My Program
Limits of Specifiability for Sensor-Based Robotic Planning Tasks

Sakcak, Basak	University of Oulu
Shell, Dylan	Texas A&M University
O'Kane, Jason	Texas A&M University
Keywords: Formal Methods in Robotics and Automation, Reactive and Sensor-Based Planning, Task and Motion Planning Abstract: There is now a large body of techniques, many based on formal methods, for describing and realizing complex robotics tasks, including those involving a variety of rich goals and time-extended behavior. This paper explores the limits of what sorts of tasks are specifiable, examining how the precise grounding of specifications, that is, whether the specification is given in terms of the robot's states, its actions and observations, its knowledge, or some other information, is crucial to whether a given task can be specified. While prior work included some description of particular choices for this grounding, our contribution treats this aspect as a first-class citizen: we introduce notation to deal with a large class of problems, and examine how the grounding affects what tasks can be posed. The results demonstrate that certain classes of tasks are specifiable under different combinations of groundings.


TuBT1 Regular Session, 302	Add to My Program
Award Finalists 2

Chair: Smart, William D.	Oregon State University
Co-Chair: Asada, Harry	MIT

11:15-11:20, Paper TuBT1.1	Add to My Program
Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition

Luo, Shengcheng	Shanghai Jiao Tong University
Peng, Quanquan	Shanghai Jiao Tong University
Lv, Jun	Shanghai Jiao Tong University
Hong, Kaiwen	University of Illinois at Urbana Champaign
Driggs-Campbell, Katherine	University of Illinois at Urbana-Champaign
Lu, Cewu	ShangHai Jiao Tong University
Li, Yong-Lu	Shanghai Jiao Tong University
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation, Human-Robot Collaboration Abstract: Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system presents inherent challenges due to the task's high dimensionality, complexity of motion, and differences between physiological structures. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, simplifies the data collection process and facilitating simultaneous human demonstration collection and robot manipulation training. As data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a trade-off between manual and automated control. We conducted experiments in both simulated environments and physical real-world settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. More details please refer to our main page https://norweig1an.github.io/HAJL.github.io/.

11:20-11:25, Paper TuBT1.2	Add to My Program
To Ask or Not to Ask: Human-In-The-Loop Contextual Bandits with Applications in Robot-Assisted Feeding

Banerjee, Rohan	Cornell University
Jenamani, Rajat Kumar	Cornell University
Vasudev, Sidharth	Cornell University
Nanavati, Amal	University of Washington
Dimitropoulou, Katherine	Columbia University
Dean, Sarah	Cornell University
Bhattacharjee, Tapomayukh	Cornell University
Keywords: Human Factors and Human-in-the-Loop, Reinforcement Learning, Physically Assistive Devices Abstract: Robot-assisted bite acquisition involves picking up food items with varying shapes, compliance, sizes, and textures. Fully autonomous strategies may not generalize efficiently across this diversity. We propose leveraging feedback from the care recipient when encountering novel food items. However, frequent queries impose a workload on the user. We formulate human-in-the-loop bite acquisition within a contextual bandit framework and introduce LinUCB-QG, a method that selectively asks for help using a predictive model of querying workload based on query types and timings. This model is trained on data collected in an online study involving 14 participants with mobility limitations, 3 occupational therapists simulating physical limitations, and 89 participants without limitations. We demonstrate that our method better balances task performance and querying workload compared to autonomous and always-querying baselines and adjusts its querying behavior to account for higher workload in users with mobility limitations. This is validated through experiments in a simulated food dataset and a user study with 19 participants, including one with severe mobility limitations. Please check out our project website at: https://emprise.cs.cornell.edu/hilbiteacquisition/.

11:25-11:30, Paper TuBT1.3	Add to My Program
Point and Go: Intuitive Reference Frame Reallocation in Mode Switching for Assistive Robotics

Wang, Allie	University of Alberta
Jiang, Chen	University of Alberta
Przystupa, Michael	University of Alberta
Valentine, Justin	University of Alberta
Jagersand, Martin	University of Alberta
Keywords: Rehabilitation Robotics, Kinematics, Physically Assistive Devices Abstract: Operating high degree of freedom robots can be difficult for users of wheelchair mounted robotic manipulators. Mode switching in Cartesian space has several drawbacks such as unintuitive control reference frames, separate translation and orientation control, and limited movement capabilities that hinder performance. We propose Point and Go mode switching, which reallocates the Cartesian mode switching reference frames into a more intuitive action space comprised of new translation and rotation modes. We use a novel sweeping motion to point the gripper, which defines the new translation axis along the robot base frame's horizontal plane. This creates an intuitive 'point and go' translation mode that allows the user to easily perform complex, human-like movements without switching control modes. The system's rotation mode combines position control with a refined end-effector oriented frame that provides precise and consistent robot actions in various end-effector poses. We verified its effectiveness through initial experiments, followed by a three-task user study that compared our method to Cartesian mode switching and a state of the art learning method. Results show that Point and Go mode switching reduced completion times by 31%, pauses by 41%, and mode switches by 33%, while receiving significantly favorable responses in user surveys.

11:30-11:35, Paper TuBT1.4	Add to My Program
RoboCrowd: Scaling Robot Data Collection through Crowdsourcing

Mirchandani, Suvir	Stanford University
Yuan, David D.	Stanford University
Burns, Kaylee	Stanford University
Islam, Md Sazzad	Stanford University
Zhao, Zihao	Stanford University
Finn, Chelsea	Stanford University
Sadigh, Dorsa	Stanford University
Keywords: Telerobotics and Teleoperation, Data Sets for Robot Learning, Human Factors and Human-in-the-Loop Abstract: In recent years, imitation learning from large-scale human demonstrations has emerged as a promising paradigm for training robot policies. However, the burden of collecting large quantities of human demonstrations is significant in terms of collection time and the need for access to expert operators. We introduce a new data collection paradigm, RoboCrowd, which distributes the workload by utilizing crowdsourcing principles and incentive design. RoboCrowd helps enable scalable data collection and facilitates more efficient learning of robot policies. We build RoboCrowd on top of ALOHA (Zhao et al. 2023)---a bimanual platform that supports data collection via puppeteering---to explore the design space for crowdsourcing in-person demonstrations in a public environment. We propose three classes of incentive mechanisms to appeal to users' varying sources of motivation for interacting with the system: material rewards, intrinsic interest, and social comparison. We instantiate these incentives through tasks that include physical rewards, engaging or challenging manipulations, as well as gamification elements such as a leaderboard. We conduct a large-scale, two-week field experiment in which the platform is situated in a university cafe. We observe significant engagement with the system---over 200 individuals independently volunteered to provide a total of over 800 interaction episodes. Our findings validate the proposed incentives as mechanisms for shaping users' data quantity and quality. Further, we demonstrate that the crowdsourced data can serve as useful pre-training data for policies fine-tuned on expert demonstrations---boosting performance up to 20% compared to when this data is not available. These results suggest the potential for RoboCrowd to reduce the burden of robot data collection by carefully implementing crowdsourcing and incentive design principles. Videos are available at https://robocrowd.github.io.

11:35-11:40, Paper TuBT1.5	Add to My Program
How Sound-Based Robot Communication Impacts Perceptions of Robotic Failure

Crider, Jai'La Lee	Oregon State University
Preston, Rhian	Oregon State University
Fitter, Naomi T.	Oregon State University
Keywords: Social HRI, Robot Companions, Natural Dialog for HRI Abstract: One challenge in human-robot interaction is selecting communication methods that fit a given robotic system and avoid overpromising. For example, verbal speech provides a clear and easy-to-understand communication method, but can inflate expectations of robot abilities. Is verbal speech the ultimate option? Might other tactics provide similar advantages with fewer downsides? The presented work focuses on addressing these important questions by 1) quantifying any inflated opinions of robots that use verbal speech and 2) gathering perspectives on alternative nonverbal sound-based communication tactics (as a means to potentially shrink gaps between expected and actual robot performance). We conducted a within-subjects online study that varied robot communication modes in videos of successful and unsuccessful mock tasks by a modern commercial robot. Assessments of robot competence and trust after an observed robot failure were higher for verbal robots, but we observed less decline in competence and trust ratings due to the failure for a nonverbal robot using character-like sound (compared to a robot using verbal communication). Human-robot interaction practitioners can use our results to design effective and robust communication strategies for robots.

11:40-11:45, Paper TuBT1.6	Add to My Program
Obstacle-Avoidant Leader Following with a Quadruped Robot

Scheidemann, Carmen	ETH Zurich
Werner, Lennart	ETH Zürich
Reijgwart, Victor	ETH Zurich
Cramariuc, Andrei	ETHZ
Chomarat, Joris	ETH Zurich
Chiu, Jia-Ruei	ETH Zurich
Siegwart, Roland	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Human-Centered Robotics, Human Detection and Tracking, Legged Robots Abstract: Personal mobile robotic assistants are expected to find wide applications in industry and healthcare. For example, people with limited mobility can benefit from robots helping with daily tasks, or construction workers can have robots perform precision monitoring tasks on-site. However, manually steering a robot while in motion requires significant concentration from the operator, especially in tight or crowded spaces. This reduces walking speed, and the constant need for vigilance increases fatigue and, thus, the risk of accidents. This work presents a virtual leash with which a robot can naturally follow an operator. We use a sensor fusion based on a custom-built RF transponder, RGB cameras, and a LiDAR. In addition, we customize a local avoidance planner for legged platforms, which enables us to navigate dynamic and narrow environments. We successfully validate on the ANYmal platform the robustness and performance of our entire pipeline in real-world experiments.


TuBT2 Regular Session, 301	Add to My Program
Transfer and Continual Learning

Chair: Gupta, Abhishek	University of Washington
Co-Chair: Nemlekar, Heramb	Virginia Tech

11:15-11:20, Paper TuBT2.1	Add to My Program
Semantic Cross-Pose Correspondence from a Single Example

Hadjivelichkov, Denis	University College London
Zwane, Sicelukwanda Njabuliso Tunner	University College London
Deisenroth, Marc Peter	University College London
Agapito, Lourdes	University College London
Kanoulas, Dimitrios	University College London
Keywords: Representation Learning, Transfer Learning, Learning from Demonstration Abstract: This article focuses on predicting how an object can be transformed to a semantically meaningful pose relative to another object, given only one or few examples. Current pose correspondence methods rely on vast 3D object datasets and do not actively consider semantic information, which limits the objects to which they can be applied. We present a novel method for learning cross-object pose correspondence. rev{The proposed method detects interacting object parts, performs one-shot part correspondence, and uses geometric and visual-semantic features. Given one example of two objects posed relative to each other, the model can learn how to transfer the demonstrated relations to unseen object instances.

11:20-11:25, Paper TuBT2.2	Add to My Program
H2O+: An Improved Framework for Hybrid Offline-And-Online RL with Dynamics Gaps

Niu, Haoyi	Tsinghua University
Ji, Tianying	Tsinghua University
Bingqi, Liu	Beihang University
Zhao, Haocheng	Tsinghua University
Zhu, Xiangyu	Tsinghua University
Zheng, Jianying	Beihang University
Huang, Pengfei	Tsinghua University
Zhou, Guyue	Tsinghua University
Hu, Jianming	Tsinghua University
Zhan, Xianyuan	Tsinghua University
Keywords: Transfer Learning, Reinforcement Learning, Machine Learning for Robot Control Abstract: Solving real-world complex tasks using reinforcement learning (RL) without high-fidelity simulation environments or large amounts of offline data can be quite challenging. Online RL agents trained in imperfect simulation environments can suffer from severe sim-to-real issues. Offline RL approaches although bypass the need for simulators, often pose demanding requirements on the size and quality of the offline datasets. The recently emerged hybrid offline-and-online RL provides an attractive framework that enables joint use of limited offline data and imperfect simulator for transferable policy learning. In this paper, we develop a new algorithm, called H2O+, which offers great flexibility to bridge various choices of offline and online learning methods, while also accounting for dynamics gaps between the real and simulation environment. Through extensive simulation and real-world robotics experiments, we demonstrate superior performance and flexibility of H2O+ over advanced cross-domain online and offline RL algorithms.

11:25-11:30, Paper TuBT2.3	Add to My Program
M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning

Roy, Kaushik	CSIRO
Dissanayake, Akila	Commonwealth Scientific and Industrial Research Organisation
Tidd, Brendan	CSIRO
Moghadam, Peyman	CSIRO
Keywords: Continual Learning, Incremental Learning, Imitation Learning Abstract: Lifelong imitation learning for manipulation tasks poses significant challenges due to distribution shifts that occur in incremental learning steps. Existing methods often focus on unsupervised skill discovery to construct an ever-growing skill library or distillation from multiple policies, which can lead to scalability issues as diverse manipulation tasks are continually introduced and may fail to ensure a consistent latent space throughout the learning process, leading to catastrophic forgetting of previously learned skills. In this paper, we introduce M2Distill, a multi-modal distillation-based method for lifelong imitation learning focusing on preserving consistent latent space across vision, language, and action distributions throughout the learning process. By regulating the shifts in latent representations across different modalities from previous to current steps, and reducing discrepancies in Gaussian Mixture Model (GMM) policies between consecutive learning steps, we ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills. Extensive evaluations on the LIBERO lifelong imitation learning benchmark suites, including LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method consistently outperforms prior state-of-the-art methods across all evaluated metrics.

11:30-11:35, Paper TuBT2.4	Add to My Program
Expert-Enhanced Masked Point Modeling for Point Cloud Self-Supervised Learning

Liu, Yujun	Tsinghua University
Zha, Yaohua	Tsinghua University
Li, Naiqi	Tsinghua University
Tao, Dai	Shenzhen University
Chen, Bin	Harbin Institute of Technology, Shenzhen
Xia, Shu-Tao	Tsinghua University
Keywords: Transfer Learning, Object Detection, Segmentation and Categorization, Deep Learning Methods Abstract: Recently, learning-based point cloud analysis has played a crucial role in robotic perception. Masked Point Modeling (MPM), owing to its powerful representational capabilities, has become the mainstream point cloud self-supervised learning method. However, existing MPM-based methods often suffer from the problem of negative transfer, due to the disparity in semantic distribution between upstream data and downstream data. To address this issue, we propose an expert enhancement strategy for existing MPM-based methods. Specifically, we insert a Sparse Mixture of Experts (SMoE) layer after each block of the backbone network, which utilizes a multi-branch expert architecture with routers that allocate data of different semantics to the appropriate experts for analysis. During the pre-training phase, our expert-enhanced model not only learns universal 3D representations for the backbone network but also acquires powerful semantic routing capabilities for all expert layers. In the fine-tuning phase, we freeze all backbones and conduct end-to-end fine-tuning solely on our expert layers to adaptively select multiple experts most relevant to the semantics of each downstream data for analysis. Extensive downstream experiments demonstrate the superiority of our method, especially outperforming baseline (Point-MAE) by 5.16%, 5.86%, and 4.62% in three variants of ScanObjectNN while utilizing only 12% of its trainable parameters. Our code is released at https://github.com/chenchen1104/point_e2mae.

11:35-11:40, Paper TuBT2.5	Add to My Program
3D Dense Captioning Via Prototypical Momentum Distillation

Mi, Jinpeng	USST
Wang, Ying	University of Shanghai for Science and Technology
Jin, Shaofei	University of Shanghai for Science and Technology
Zhang, Shiming	University of Shanghai for Science and Technology
Wei, Xian	East China Normal University
Zhang, Jianwei	Hamburg University
Keywords: Transfer Learning, Deep Learning for Visual Perception Abstract: 3D dense captioning aims to describe the crucial regions in 3D visual scenes in the form of natural language. Recent prevailing approaches achieve promising results by leveraging complicated structures incorporated with large-scale models, which necessitate abundant parameters and pose challenges regarding its practical applications. Besides, with limited training data, 3D dense captioners are often susceptible to overfitting, directly degrading caption generation performance. Drawing inspiration from the recent advancements in knowledge distillation, we propose a novel approach termed Prototypical Momentum Distillation (PMD) to prompt the model to generate more detailed captions. PMD incorporates Momentum Distillation (MD) with an Uncertainty-aware Prototype-anchored Clustering (UPC) strategy to transfer knowledge by considering the uncertainty of the teacher knowledge. Specifically, we employ the original captioner as the student model and maintain an Exponential Moving Average (EMA) copy of the captioner as the teacher model to impart knowledge as the auxiliary supervision of the student. To abate the misleading caused by uncertain knowledge, we present an Uncertainty-aware Prototype-anchored Clustering (UPC) strategy to cluster the distilled knowledge according to its confidence. We then transfer the rearranged knowledge from the teacher to guide the training route of the student. We conduct extensive experiments and ablation studies on two widely used benchmark datasets, ScanRefer and Nr3D. Experimental results demonstrate that PMD outperforms all state-of-the-art approaches on the benchmarks with MLE training, highlighting its effectiveness.

11:40-11:45, Paper TuBT2.6	Add to My Program
DUOLINGO: Dynamics Utilization for Online Translation of Actions

Vemuri, Karthikeya	University of Washington
Wu, Alan	University of Washington
Thareja, Arnav	University of Washington
Chen, Zoey	University of Washington
Good, Ian	University of Washington
Lipton, Jeffrey	Northeastern University
Gupta, Abhishek	University of Washington
Keywords: Continual Learning, Transfer Learning, Robust/Adaptive Control Abstract: Robots in the real world experience wear and tear, leading to changing system dynamics. This challenge is particularly exacerbated for non-rigid systems such as soft robots or robotic systems made of meta-materials with hysteresis. This setting results in a challenging problem for most learning-based controllers that typically rely on the assumption that the system dynamics remain fixed over time. In the absence of explicit mechanisms to account for this change in dynamics, learning-based control algorithms show considerable degradation in performance over time. In this work, we consider a particular class of dynamics shift in under-actuated systems, that is localized to the dynamics of the fully actuated robot itself, while independently leaving the dynamics of the environment unchanged. This captures real-world phenomena such as fatigue or hysteresis in robotic systems. In this setting, we propose an efficient algorithm that can account for dynamics shift. Using a simple calibration procedure, we propose a technique for learning a non-linear ``action-translation" model that can capture the localized shift in dynamics. This enables continual learning and transfer despite considerable dynamics shift during the learning process. We demonstrate the efficacy of this procedure on several tasks in simulation, as well as a real-world robotic system - a 4 DoF electrically driven handed shearing auxetic (HSA) platform.


TuBT3 Regular Session, 303	Add to My Program
Field Robotics: Forestry and Mining

Chair: Vu, Minh Nhat	TU Wien, Austria
Co-Chair: Sharf, Inna	McGill University

11:15-11:20, Paper TuBT3.1	Add to My Program
DigiForests: A Longitudinal LiDAR Dataset for Forestry Robotics

Malladi, Meher Venkata Ramakrishna	University of Bonn
Chebrolu, Nived	University of Oxford
Scacchetti, Irene	University of Bonn
Lobefaro, Luca	University of Bonn
Guadagnino, Tiziano	University of Bonn
Casseau, Benoit	University of Oxford
Oh, Haedam	University of Oxford
Freißmuth, Leonard	Technical University Munich
Karppinen, Markus	PreFor Ltd
Schweier, Janine	Swiss Federal Institute for Forest, Snow and Landscape Research
Leutenegger, Stefan	Technical University of Munich
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Fallon, Maurice	University of Oxford
Keywords: Robotics and Automation in Agriculture and Forestry, Data Sets for Robotic Vision Abstract: Forests are vital to our ecosystems, acting as carbon sinks, climate stabilizers, biodiversity centers, and wood sources. Due to their scale, monitoring and managing forests takes a lot of work. Forestry robotics offers the potential for enabling efficient and sustainable foresting practices through automation. Despite increasing interest in this field, the scarcity of robotics datasets and benchmarks in forest environments is hampering progress in this domain. In this paper, we present a real-world, longitudinal dataset for forestry robotics that enables the development and comparison of approaches for various relevant applications, ranging from semantic interpretation to estimating traits relevant to forestry management. The dataset consists of multiple recordings of the same plots in a forest in Switzerland during three different growth periods. We recorded the data with a mobile 3D LiDAR scanning setup. Additionally, we provide semantic annotations of trees, shrubs, and ground, instance-level annotations of trees, as well as more fine-grained annotations of tree stems and crowns. Furthermore, we provide reference field measurements of traits relevant to forestry management for a subset of the trees. Together with the data, we also provide open-source baseline panoptic segmentation and tree trait estimation approaches to enable the community to bootstrap further research and simplify comparisons in this domain.

11:20-11:25, Paper TuBT3.2	Add to My Program
Near Time-Optimal Hybrid Motion Planning for Timber Cranes

Ecker, Marc-Philip	TU Wien, Austrian Institute of Technology
Bischof, Bernhard	Austrian Institute of Technology
Vu, Minh Nhat	TU Wien, Austria
Froehlich, Christoph	Austrian Institute of Technology
Glück, Tobias	AIT Austrian Institute of Technology GmbH
Kemmetmueller, Wolfgang	TU Wien
Keywords: Robotics and Automation in Agriculture and Forestry, Motion and Path Planning Abstract: Efficient, collision-free motion planning is essential for automating large-scale manipulators like timber cranes. They come with unique challenges such as hydraulic actuation constraints and passive joints—factors that are seldom addressed by current motion planning methods. This paper introduces a novel approach for time-optimal, collision-free hybrid motion planning for a hydraulically actuated timber crane with passive joints. We enhance the via-point-based stochastic trajectory optimization (VP-STO) algorithm to include pump flow rate constraints and develop a novel collision cost formulation to improve robustness. The effectiveness of the enhanced VP-STO as an optimal single-query global planner is validated by comparison with an informed RRT* algorithm using a time-optimal path parameterization (TOPP). The overall hybrid motion planning is formed by combination with a gradient-based local planner that is designed to follow the global planner's reference and to systematically consider the passive joint dynamics for both collision avoidance and sway damping.

11:25-11:30, Paper TuBT3.3	Add to My Program
An Ultra-Light Seedling Planting Mechanism for Use in Aerial Reforestation

Lloyd, Steffan	Norwegian Institute of Bioeconomy Research (NIBIO)
Astrup, Rasmus	Norwegian Institute for Bioeconomy Research (NIBIO)
Keywords: Robotics and Automation in Agriculture and Forestry, Aerial Systems: Applications, Mechanism Design Abstract: This article presents a novel, ultralight tree planting mechanism for use on an aerial vehicle. Current tree planting operations are typically performed manually, and existing automated solutions use large land-based vehicles or excavators which cause significant site damage and are limited to open, clear-cut plots. Our device uses a high-pressure compressed air power system and a novel double-telescoping design to achieve a weight of only 8 kg: well within the payload capacity of medium to large drones. This article describes the functionality and key components of the device and validates its feasibility through experimental testing. We propose this mechanism as a cost-effective, highly scalable solution that avoids ground damage, produces minimal emissions, and can operate equally well on open clear-cut sites as in denser, selectively-harvested forests.

11:30-11:35, Paper TuBT3.4	Add to My Program
Towards Autonomous Wood-Log Grasping with a Forestry Crane: Simulator and Benchmarking

Vu, Minh Nhat	TU Wien, Austria
Wachter, Alexander	TU Wien
Ebmer, Gerald	TU Wien
Ecker, Marc-Philip	TU Wien, Austrian Institute of Technology
Glück, Tobias	AIT Austrian Institute of Technology GmbH
Nguyen, Anh	University of Liverpool
Kemmetmueller, Wolfgang	TU Wien
Kugi, Andreas	TU Wien
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation Abstract: Forestry machines operated in forest production environments face challenges when performing manipulation tasks, especially regarding the complicated dynamics of underactuated crane systems and the different sizes of logs to be grasped. This study investigates the feasibility of using reinforcement learning for forestry crane manipulators in grasping and lifting a varying-diameter wood log in a simulation environment. The Mujoco physics engine creates realistic scenarios, including modeling a forestry crane with 8 degrees of freedom from CAD data and wood logs of different sizes. Our results show the successful implementation of a velocity controller for log grasping by deep reinforcement learning using a curriculum strategy. Given the six degrees of freedom (DoF) poses of the wood log, i.e., the 3D Cartesian position and the orientation, the proposed control strategy exhibits a success rate of 96% when grasping logs of different diameters and under random initial configurations of the forestry crane. In addition, reward functions and reinforcement learning baselines are investigated to provide an open-source benchmark for the community in large-scale manipulation tasks. A video with several demonstrations can be seen at https://www.acin.tuwien.ac.at/en/d18a/.

11:35-11:40, Paper TuBT3.5	Add to My Program
Designing Experimental Setup Emulating Log-Loader Manipulator and Implementing Anti-Sway Trajectory Planner

Jebellat, Iman	McGill University
Sideris, George	McGill University
Saif, Rafid	McGill University
Sharf, Inna	McGill University
Keywords: Robotics and Automation in Agriculture and Forestry, Manipulation Planning, Mechanism Design Abstract: Forestry machines are not easily accessible for experimentation or demonstration of research results. These mobile robots are massive, very expensive, and require a large outdoor space and permits to operate. These factors hinder conducting experiments on real forestry robots. Thus, it is essential to design experimental setups utilizing easily accessible robots in indoor labs that can effectively replicate the behavior of interest of a forestry machine. We design a setup to resemble a log-loader crane and grapple motions using a Kinova Jaco2 arm by manufacturing a specialized end-effector to attach passively to the Jaco2 arm. Passively attached grapple causes undesirable sway, which is problematic and dangerous in forestry. To address the sway problem, we employ dynamic programming to develop an anti-sway motion planner, and validate its performance for different point-to-point maneuvers in our experimental setup. We also repeat each experiment at least 6 times to ensure the repeatability and reliability of the experiments. The experimental results showcase the excellent sway-damping performance of our planner and also the very good repeatability of our experiments.

11:40-11:45, Paper TuBT3.6	Add to My Program
FRAME: A Modular Framework for Autonomous Map Merging: Advancements in the Field (I)

Stathoulopoulos, Nikolaos	Luleå University of Technology
Lindqvist, Björn	Luleå University of Technology
Koval, Anton	Luleå University of Technology
Agha-mohammadi, Ali-akbar	NASA-JPL, Caltech
Nikolakopoulos, George	Luleå University of Technology
Keywords: Multi-Robot Systems, Field Robots Abstract: In this article, a novel approach for merging 3-D point cloud maps in the context of egocentric multirobot exploration is presented. Unlike traditional methods, the proposed approach leverages state-of-the-art place recognition and learned descriptors to efficiently detect overlap between maps, eliminating the need for the time-consuming global feature extraction and feature matching process. The estimated overlapping regions are used to calculate a homogeneous rigid transform, which serves as an initial condition for the general iterative closest point (GICP) point cloud registration algorithm to refine the alignment between the maps. The advantages of this approach include faster processing time, improved accuracy, and increased robustness in challenging environments. Furthermore, the effectiveness of the proposed framework is successfully demonstrated through multiple field missions of robot exploration in a variety of different underground environments.


TuBT4 Regular Session, 304	Add to My Program
Vision-Based Tactile Sensors 2

Chair: Moghadam, Peyman	CSIRO
Co-Chair: Jenkin, Michael	York University

11:15-11:20, Paper TuBT4.1	Add to My Program
Evetac: An Event-Based Optical Tactile Sensor for Robotic Manipulation

Funk, Niklas Wilhelm	TU Darmstadt
Helmut, Erik	Technische Universität Darmstadt
Chalvatzaki, Georgia	Technische Universität Darmstadt
Calandra, Roberto	TU Dresden
Peters, Jan	Technische Universität Darmstadt
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Deep Learning in Robotics and Automation, Event-based Sensing Abstract: Optical tactile sensors have recently become popular. They provide high spatial resolution, but struggle to offer fine temporal resolutions. To overcome this shortcoming, we study the idea of replacing the RGB camera with an event-based camera and introduce a new event-based optical tactile sensor called Evetac. Along with hardware design, we develop touch processing algorithms to process its measurements online at 1000 Hz. We devise an efficient algorithm to track the elastomer’s deformation through the imprinted markers despite the sensor’s sparse output. Benchmarking experiments demonstrate Evetac’s capabilities of sensing vibrations up to 498 Hz, reconstructing shear forces, and significantly reducing data rates compared to RGB optical tactile sensors. Moreover, Evetac’s output and the marker tracking provide meaningful features for learning data-driven slip detection and prediction models. The learned models form the basis for a robust and adaptive closed-loop grasp controller capable of handling a wide range of objects. We believe that fast and efficient event-based tactile sensors like Evetac will be essential for bringing human-like manipulation capabilities to robotics.

11:20-11:25, Paper TuBT4.2	Add to My Program
Shape-Space Deformer: Unified Visuo-Tactile Representations for Robotic Manipulation of Deformable Objects

Collins, Sean Michael Varian	CSIRO
Tidd, Brendan	CSIRO
Baktashmotlagh, Mahsa	UQ
Moghadam, Peyman	CSIRO
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Representation Learning Abstract: Accurate modeling of object deformations is crucial for a wide range of robotic manipulation tasks, where interacting with soft or deformable objects is essential. Current methods struggle to generalize to unseen forces or adapt to new objects, limiting their utility in real-world applications. We propose Shape-Space Deformer, a unified representation for encoding a diverse range of object deformations using template augmentation to achieve robust, fine-grained reconstructions that are resilient to outliers and unwanted artifacts. Our method improves generalization to unseen forces and can rapidly adapt to novel objects, significantly outperforming existing approaches. We perform extensive experiments to test a range of force generalisation settings and evaluate our methods ability to reconstruct unseen deformations, demonstrating significant improvements in reconstruction accuracy and robustness. Our approach is suitable for real-time performance, making it ready for downstream manipulation applications.

11:25-11:30, Paper TuBT4.3	Add to My Program
Depth Estimation through Translucent Surfaces

Dai, Siyu	Amazon
Lou, Xibai	Amazon.com LLC
Nilsson, Petter	Amazon
Thakar, Shantanu	Amazon.com
Meeker, Cassie	Columbia University
Gordon, Ariel	Amazon
Kong, Xiangxin	Amazon
Zhang, Jenny	Amazon Robotics
Knoerlein, Benjamin	Amazon
Ruguan, Liu	Amazon.com
Chandrashekhar, Bhavana mysore	Amazon
Karumanchi, Sisir	Amazon
Keywords: Perception for Grasping and Manipulation, Data Sets for Robotic Vision, Logistics Abstract: In this paper, we tackle the novel computer vision problem of depth estimation through a translucent barrier. This is an important problem for robotics when manipulating objects through plastic wrapping, or when predicting the depth of items behind a translucent barrier for manipulation. We propose two approaches for providing depth prediction models the ability to see through translucent barriers: removing translucent barriers through image inpainting before passing to standard depth prediction models as input, and directly training depth models with images with translucent barriers. We show that image inpainting allows standard learned monocular and stereo depth estimation models to achieve 3 cm MAE for predicting depth of shelved items behind plastic, whereas training with real images with translucent barriers allows them to achieve centimeter or sub-centimeter MAE. We demonstrate in real robot experiments that depth-aided space estimation allows the robot to place 46% additional items into shelves with translucent barriers. This paper also provides a publicly available dataset of objects occluded by translucent barriers in a tabletop environment and a shelf environment which will allow others to contribute to this novel problem that's critical for many robotic manipulation applications including suction gripping and item packing.

11:30-11:35, Paper TuBT4.4	Add to My Program
Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor

Ablett, Trevor	University of Toronto
Limoyo, Oliver	University of Toronto
Sigal, Adam	McGill University
Jilani, Affan	McGill University
Kelly, Jonathan	University of Toronto
Siddiqi, Kaleem	McGill University
Hogan, Francois	Massachusetts Institute of Technology
Dudek, Gregory	McGill University
Keywords: Force and Tactile Sensing, Learning from Demonstration, Deep Learning in Robotics and Automation, Imitation Learning Abstract: Contact-rich tasks continue to present many challenges for robotic manipulation. In this work, we leverage a multimodal visuotactile sensor within the framework of imitation learning (IL) to perform contact-rich tasks that involve relative motion (e.g., slipping and sliding) between the end-effector and the manipulated object. We introduce two algorithmic contributions, tactile force matching and learned mode switching, as complimentary methods for improving IL. Tactile force matching enhances kinesthetic teaching by reading approximate forces during the demonstration and generating an adapted robot trajectory that recreates the recorded forces. Learned mode switching uses IL to couple visual and tactile sensor modes with the learned motion policy, simplifying the transition from reaching to contacting. We perform robotic manipulation experiments on four door-opening tasks with a variety of observation and algorithm configurations to study the utility of multimodal visuotactile sensing and our proposed improvements. Our results show that the inclusion of force matching raises average policy success rates by 62.5%, visuotactile mode switching by 30.3%, and visuotactile data as a policy input by 42.5%, emphasizing the value of see-through tactile sensing for IL, both for data collection to allow force matching, and for policy execution to enable accurate task feedback.

11:35-11:40, Paper TuBT4.5	Add to My Program
DotTip: Enhancing Dexterous Robotic Manipulation with a Tactile Fingertip Featuring Curved Perceptual Morphology

Zheng, Haoran	Zhejiang University
Shi, Xiaohang	Zhejiang University
Bao, Ange	Zhejiang Univeristy
Jin, Yongbin	ZJU-Hangzhou Global Scientific and Technological Innovation Cent
Zhao, Pei	Zhejiang University
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Dexterous Manipulation Abstract: Tactile sensing technologies enable robots to interact with the environment in increasingly nuanced and dexterous ways. A significant gap in this domain is the absence of curved tactile sensors, which are essential for performing sophisticated manipulation tasks. In this study, we present DotTip, a tactile fingertip featuring a three-dimensional curved perceptual surface that closely mimics human fingertip morphology. A convolutional neural network-based deep learning framework precisely calculates the contact angles and forces from the sensor tactile images, achieving mean errors of 1.56 degrees and 0.28 N, respectively. DotTip's performance is evaluated in real-world tasks, demonstrating its efficacy in tactile servoing, slip prevention, and grasping, along with the more challenging benchmark task of controlling a joystick. These findings demonstrate that DotTip possesses superior 3D tactile sensing capabilities necessary for fine-grained dexterous manipulations compared to its flat counterparts.

11:40-11:45, Paper TuBT4.6	Add to My Program
Visual-Tactile Inference of 2.5D Object Shape from Marker Texture

Jilani, Affan	McGill University
Hogan, Francois	Massachusetts Institute of Technology
Morissette, Charlotte	McGill University
Dudek, Gregory	McGill University
Jenkin, Michael	York University
Siddiqi, Kaleem	McGill University
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Recognition Abstract: Visual-tactile sensing affords abundant capabilities for contact-rich object manipulation tasks including grasping and placing. Here we introduce a shape-from-texture inspired contact shape estimation approach for visual-tactile sensors equipped with visually distinct membrane markers. Under a perspective projection camera model, measurements related to the change in marker separation upon contact are used to recover surface shape. Our approach allows for shape sensing in real time, without requiring network training or complex assumptions related to lighting, sensor geometry or marker placement. Experiments show that the surface contact shape recovered is qualitatively and quantitatively consistent with those obtained through the use of photometric stereo, the current state of the art for shape recovery in visual-tactile sensors. Importantly, our approach is applicable to a large family of sensors not equipped with photometric stereo hardware, and also to those with semi-transparent membranes. The recovery of surface shape affords new capabilities to these sensors for robotic applications, such as the estimation of contact and slippage in object manipulation tasks and the use of force matching for kinesthetic teaching using multimodal visual-tactile sensing.


TuBT5 Regular Session, 305	Add to My Program
Aerial Robots 2

Chair: Smeur, Ewoud	Delft University of Technology
Co-Chair: Weiss, Stephan	Universität Klagenfurt

11:15-11:20, Paper TuBT5.1	Add to My Program
STHN: Deep Homography Estimation for UAV Thermal Geo-Localization with Satellite Imagery

Xiao, Jiuhong	New York University
Zhang, Ning	TII
Tortei, Daniel	Technology Innovation Institute
Loianno, Giuseppe	New York University
Keywords: Deep Learning for Visual Perception, Aerial Systems: Applications, Localization Abstract: Accurate geo-localization of Unmanned Aerial Vehicles (UAVs) is crucial for outdoor applications including search and rescue operations, power line inspections, and environmental monitoring. The vulnerability of Global Navigation Satellite Systems (GNSS) signals to interference and spoofing necessitates the development of additional robust localization methods for autonomous navigation. Visual Geo-localization (VG), leveraging onboard cameras and reference satellite maps, offers a promising solution for absolute localization. Specifically, Thermal Geo-localization (TG), which relies on image-based matching between thermal imagery with satellite databases, stands out by utilizing infrared cameras for effective nighttime localization. However, the efficiency and effectiveness of current TG approaches, are hindered by dense sampling on satellite maps and geometric noises in thermal query images. To overcome these challenges, we introduce STHN, a novel UAV thermal geo-localization approach that employs a coarse-to-fine deep homography estimation method. This method attains reliable thermal geo-localization within a 512-meter radius of the UAV's last known location even with a challenging 11% size ratio between thermal and satellite images, despite the presence of indistinct textures and self-similar patterns. We further show how our research significantly enhances UAV thermal geo-localization performance and robustness against geometric noises under low-visibility conditions in the wild. The code is made publicly available.

11:20-11:25, Paper TuBT5.2	Add to My Program
Vision Transformers for End-To-End Vision-Based Quadrotor Obstacle Avoidance

Bhattacharya, Anish	University of Pennsylvania, GRASP
Rao, Nishanth Arun	University of Pennsylvania
Parikh, Dhruv Ketan	University of Pennsylvania
Kunapuli, Pratik	University of Pennsylvania
Wu, Yuwei	University of Pennsylvania
Tao, Yuezhan	University of Pennsylvania
Matni, Nikolai	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy Abstract: We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to be effective for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all speeds. We assess performance at speeds of 1-7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.

11:25-11:30, Paper TuBT5.3	Add to My Program
DroneDiffusion: Robust Quadrotor Dynamics Learning with Diffusion Models

Das, Avirup	University of Manchester
Yadav, Rishabh Dev	The University of Manchester
Sun, Sihao	Delft University of Technology
Sun, Mingfei	The University of Manchester
Kaski, Samuel	Aalto University, University of Manchester
Pan, Wei	The University of Manchester
Keywords: Machine Learning for Robot Control, Model Learning for Control, Robust/Adaptive Control Abstract: An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances.

11:30-11:35, Paper TuBT5.4	Add to My Program
FlightForge: Advancing UAV Research with Procedural Generation of High-Fidelity Simulation and Integrated Autonomy

Čapek, David	Czech Technical University in Prague
Hrnčíř, Jan	Czech Technical University in Prague
Baca, Tomas	Ceske Vysoke Uceni Technicke V Praze, FEL
Jirkal, Jakub	Czech Technical University in Prague
Vonasek, Vojtech	Czech Technical University in Prague
Penicka, Robert	Czech Technical University in Prague
Saska, Martin	Czech Technical University in Prague
Keywords: Software Tools for Benchmarking and Reproducibility, Aerial Systems: Perception and Autonomy, Software, Middleware and Programming Environments Abstract: Robotic simulators play a crucial role in the development and testing of autonomous systems, particularly in the realm of Uncrewed Aerial Vehicles (UAV). However, existing simulators often lack high-level autonomy, hindering their immediate applicability to complex tasks such as autonomous navigation in unknown environments. This limitation stems from the challenge of integrating realistic physics, photorealistic rendering, and diverse sensor modalities into a single simulation environment. At the same time, the existing photorealistic UAV simulators use mostly hand-crafted environments with limited environment sizes, which prevents the testing of long-range missions. This restricts the usage of existing simulators to only low-level tasks such as control and collision avoidance. To this end, we propose the novel FlightForge UAV open-source simulator. FlightForge offers advanced rendering capabilities, diverse control modalities, and, foremost, procedural generation of environments. Moreover, the simulator is already integrated with a fully autonomous UAV system capable of long-range flights in cluttered unknown environments. The key innovation lies in novel procedural environment generation and seamless integration of high-level autonomy into the simulation environment. Experimental results demonstrate superior sensor rendering capability compared to existing simulators, and also the ability of autonomous navigation in almost infinite environments.

11:35-11:40, Paper TuBT5.5	Add to My Program
AIVIO: Closed-Loop, Object-Relative Navigation of UAVs with AI-Aided Visual Inertial Odometry

Jantos, Thomas	University of Klagenfurt
Scheiber, Martin	University of Klagenfurt
Brommer, Christian	University of Klagenfurt
Allak, Eren	University of Klagenfurt
Weiss, Stephan	Universität Klagenfurt
Steinbrener, Jan	Universität Klagenfurt
Keywords: AI-Based Methods, Vision-Based Navigation, Autonomous Vehicle Navigation Abstract: Object-relative mobile robot navigation is essential for a variety of tasks, e.g. autonomous critical infrastructure inspection, but requires the capability to extract semantic information about the objects of interest from raw sensory data. While deep learning-based (DL) methods excel at inferring semantic object information from images, such as class and relative 6 degree of freedom (6-DoF) pose, they are computationally demanding and thus often not suitable for payload constrained mobile robots. In this letter we present a real-time capable unmanned aerial vehicle (UAV) system for object-relative, closed-loop navigation with a minimal sensor configuration consisting of an inertial measurement unit (IMU) and RGB camera. Utilizing a DL-based object pose estimator, solely trained on synthetic data and optimized for companion board deployment, the object-relative pose measurements are fused with the IMU data to perform object-relative localization. We conduct multiple real-world experiments to validate the performance of our system for the challenging use case of power pole inspection. An example closed-loop flight is presented in the supplementary video.

11:40-11:45, Paper TuBT5.6	Add to My Program
Unified Incremental Nonlinear Controller for the Transition Control of a Hybrid Dual-Axis Tilting Rotor Quad-Plane

Mancinelli, Alessandro	Delft University of Technology
Remes, Bart	Delft University of Technology
de Croon, Guido	Delft University of Technology
Smeur, Ewoud	Delft University of Technology
Keywords: Tilt rotor UAVs, Optimization and Optimal Control, Control Architectures and Programming, Aerial Systems: Mechanics and Control Abstract: Overactuated Tilt Rotor Unmanned Aerial Vehicles are renowned for exceptional wind resistance and a broad operational range, which poses complex control challenges due to non-affine dynamics. Traditional solutions employ multi-state switched logic controllers for transitions. Our study introduces a novel unified incremental nonlinear controller for overactuated dual-axis tilting rotor quad-planes, seamlessly managing pitch, roll, and physical actuator commands. The nonlinear control allocation problem is addressed using a sequential quadratic programming iterative optimization algorithm, well-suited for nonlinear actuator effectiveness in thrust vectoring vehicles. The controller design integrates desired roll and pitch angle inputs as an additional degree of freedom during slow airspeed phases. At high airspeed, the roll and pitch angles cannot be chosen freely and are set by the controller. We incorporate an angle of attack protection logic to prevent wing stall and a yaw rate reference model for coordinated turns. Flight tests confirm the controller's effectiveness in transitioning from hovering to forward flight, achieving desired vertical and lateral accelerations, and reverting to hover.


TuBT6 Regular Session, 307	Add to My Program
Perception for Mobile Robots 2

Chair: O'Kane, Jason	Texas A&M University
Co-Chair: Wang, Wenshan	Carnegie Mellon University

11:15-11:20, Paper TuBT6.1	Add to My Program
Graph2Nav: 3D Object-Relation Graph Generation to Robot Navigation

Shan, Tixiao	SRI International
Rajvanshi, Abhinav	SRI International
Mithun, Niluthpol Chowdhury	SRI International
Chiu, Han-Pang	SRI International
Keywords: Semantic Scene Understanding, AI-Enabled Robotics Abstract: We propose Graph2Nav, a real-time 3D object-relation graph generation framework, for autonomous navigation in the real world. Our framework fully generates and exploits both 3D objects and a rich set of semantic relationships among objects in a 3D layered scene graph, which is applicable to both indoor and outdoor scenes. It learns to generate 3D semantic relations among objects, by leveraging and advancing state-of-the-art 2D panoptic scene graph works into the 3D world via 3D semantic mapping techniques. This approach avoids previous training data constraint in learning 3D scene graphs directly from 3D data. We conduct experiments to validate the accuracy in locating 3D objects and labeling object-relations in our 3D scene graphs. We also evaluate the impact from Graph2Nav via integration with SayNav, a state-of-the-art planner based on large language models, on an unmanned ground robot to object search tasks in real environments. Our results demonstrate that modeling object relations in our scene graphs improves search efficiency in these navigation tasks.

11:20-11:25, Paper TuBT6.2	Add to My Program
Transferring Visual Knowledge: Semi-Supervised Instance Segmentation for Object Navigation across Varying Height Viewpoints

Zheng, Qiu	The Chinese University of HongKong, Shenzhen
Hu, Junjie	The Chinese University of Hong Kong, Shenzhen
Liu, Yuming	The Chinese University of Hong Kong, Shenzhen
Zeng, Zengfeng	Baidu Inc., Beijing, China
Wang, Fan	Baidu International Technology (Shenzhen) Co., Ltd
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Keywords: Object Detection, Segmentation and Categorization, Vision-Based Navigation, Autonomous Vehicle Navigation Abstract: The object navigation task requires robots to understand the semantic regularities in their environments. However, existing modular object navigation frameworks rely on instance segmentation models trained at fixed camera height viewpoints, limiting generalization performance and increasing labeling costs for new height viewpoints. To tackle this issue, we propose a semi-supervised method that transfers knowledge from a source height to a target height, minimizing the need for additional labels. Our approach introduces three key innovations: i) a projection policy to enhance the teacher model's detection capabilities at the target height, ii) a dynamic weight mechanism that emphasizes high-confidence pseudo-labels to reduce overfitting, and iii) a prototype contrast transferring method to transfer knowledge effectively. Experiments on the Habitat-Matterport 3D (HM3D) dataset show our method outperforms state-of-the-art semi-supervised techniques, improving both segmentation accuracy and navigation performance. The code is available at:https://github.com/FreeformRobotics/TransferKnowledge.

11:25-11:30, Paper TuBT6.3	Add to My Program
An Algorithm for Geometric Navigation Planning under Uncertainty Using Terrain Boundary Detection

Carley, Bennett	Texas A&M University
Bamgbelu, Adeolayemi	Texas A&M University
Zhang, XiMing	Texas A&M University
O'Kane, Jason	Texas A&M University
Keywords: Reactive and Sensor-Based Planning, Planning under Uncertainty, Marine Robotics Abstract: We explore a navigation planning problem under uncertainty for a simple robot with extremely limited sensing. Our robot can turn subject to significant proportional error and move forward. As it moves in an environment with a known terrain map, the robot can detect changes in the terrain at its current position. Given an initial pose and a goal segment, the robot should find some sequence of actions to travel reliably from start to goal, if such a sequence exists. The resulting plan should guarantee the robot reaches the goal segment despite any movement errors experienced within some known error bound. In this paper, we propose an algorithm to find such an action sequence, implement and evaluate this algorithm, and present evidence for the feasibility of such an algorithm in an underwater navigation setting.

11:30-11:35, Paper TuBT6.4	Add to My Program
ProMi: An Efficient Prototype-Mixture Baseline for Few-Shot Segmentation with Bounding-Box Annotations

Chiaroni, Florent	Thales
Ayub, Ali	Concordia University
Ahmad, Ola	Thales Canada
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Learning Categories and Concepts Abstract: In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: https://github.com/ThalesGroup/promi.

11:35-11:40, Paper TuBT6.5	Add to My Program
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes

Zhang, Haochen	Carnegie Mellon University
Zantout, Nader	Carnegie Mellon University
Kachana, Pujith	Carnegie Mellon University
Zhang, Ji	Carnegie Mellon University
Wang, Wenshan	Carnegie Mellon University
Keywords: Semantic Scene Understanding, Data Sets for Robotic Vision, Vision-Based Navigation Abstract: With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released.

11:40-11:45, Paper TuBT6.6	Add to My Program
PTS-Map: Probabilistic Terrain State Map for Uncertainty-Aware Traversability Mapping in Unstructured Environments

Kim, Dong-Wook	Seoul National University
Son, E-In	Seoul National University
Kim, Chan	Seoul National University
Hwang, Ji-Hoon	Seoul National University
Seo, Seung-Woo	Seoul National University
Keywords: Field Robots, Autonomous Vehicle Navigation, Probability and Statistical Methods Abstract: Traversability mapping for autonomous navigation in unstructured environments has been widely investigated for decades. However, it remains challenging due to the uncertainty in geometry perception and the simplified representation of traversability maps that fail to capture detailed structures of environments. We propose PTS-Map, a 2.5D probabilistic terrain state map to address these issues. PTS-Map sequentially updates the ground surface state and above-ground elevation state, explicitly distinguishing the geometric features of ground and obstacles. During state updates, we introduce a novel ground uncertainty estimation to mitigate the effects of unreliable feature measurements. By effectively designing the terrain states and addressing the uncertainty of the ground surface, PTS-Map constructs a temporally consistent traversability map that provides precise ground conditions and vertical features relevant to navigation. Experiments are conducted in various large-scale unstructured environments with distinct characteristics. PTS-Map outperforms other state-of-the-art methods in success rate and efficiency by constructing a precise traversability map of the environments.


TuBT7 Regular Session, 309	Add to My Program
Marine Robotics 1

Chair: Fischer, Tobias	Queensland University of Technology
Co-Chair: Clement, Benoit	ENSTA, Institut Polytechnique De Paris

11:15-11:20, Paper TuBT7.1	Add to My Program
Efficient Non-Myopic Layered Bayesian Optimization for Large-Scale Bathymetric Informative Path Planning

Wallén Kiessling, Alexander	Royal Institute of Technology (KTH)
Torroba Balmori, Ignacio	KTH Royal Institute of Technology
Sidrane, Chelsea	KTH Royal Institute of Technology
Stenius, Ivan	KTH
Tumova, Jana	KTH Royal Institute of Technology
Folkesson, John	KTH
Keywords: Marine Robotics, Reactive and Sensor-Based Planning, Mapping Abstract: Informative path planning (IPP) applied to bathymetric mapping allows AUVs to focus on feature-rich areas to quickly reduce uncertainty and increase mapping efficiency. Existing methods based on Bayesian optimization (BO) over Gaussian Process (GP) maps work well on small scenarios but they are short-sighted and computationally heavy when mapping larger areas, hindering deployment in real applications. To overcome this, we present a 2-layered BO IPP method that performs non-myopic, real-time planning in a tree Search fashion over large Stochastic Variational GP maps, while respecting the AUV motion constraints and accounting for localization uncertainty. Our framework outperforms the standard industrial lawn-mowing pattern and a myopic baseline in a set of hardware in the loop (HIL) experiments in an embedded platform over real bathymetry.

11:20-11:25, Paper TuBT7.2	Add to My Program
Visual Lidar Recursive Online Tracker (ViLiROT) for Autonomous Surface Vessels

Hilmarsen, Henrik	NTNU
Dalhaug, Nicholas	Norwegian University of Science and Technology
Nygård, Trym Anthonsen	NTNU
Brekke, Edmund	NTNU
Stahl, Annette	Norwegian University of Science and Technology (NTNU)
Mester, Rudolf	NTNU Trondheim
Keywords: Marine Robotics, Visual Tracking, Collision Avoidance Abstract: We propose a multi-sensor fusion pipeline for multiple object tracking in autonomous surface vessels using lidar and camera data. Our approach follows the tracking-by-detection paradigm, leveraging the precision of lidar for accurate state estimation and camera data for robust association. The method addresses issues with false tracks from lidar returns by suppressing non-moving objects on the basis of optical flow. We compare the proposed pipeline against prior work, particularly in the use of lidar and stereo cameras as depth modalities, demonstrating its effectiveness in improving tracking performance.

11:25-11:30, Paper TuBT7.3	Add to My Program
Open-Set Semantic Uncertainty Aware Metric-Semantic Graph Matching

Singh, Kurran	Massachusetts Institute of Technology
Leonard, John	MIT
Keywords: Marine Robotics Abstract: Underwater object-level mapping requires incorporating visual foundation models to handle the uncommon and often previously unseen object classes encountered in marine scenarios. In this work, a metric of semantic uncertainty for open-set object detections produced by visual foundation models is calculated and then incorporated into an object-level uncertainty tracking framework. Object-level uncertainties and geometric relationships between objects are used to enable robust object-level loop closure detection for unknown object classes. The above loop closure detection problem is formulated as a graph matching problem. While graph matching, in general, is NP-Complete, a solver for an equivalent formulation of the proposed graph matching problem as a graph editing problem is tested on multiple challenging underwater scenes. Results for this solver as well as three other solvers demonstrate that the proposed methods are feasible for real-time use in marine environments for the robust, open-set, multi-object, semantic-uncertainty-aware loop closure detection. Further experimental results on the KITTI dataset demonstrate that the method generalizes to large-scale terrestrial scenes.

11:30-11:35, Paper TuBT7.4	Add to My Program
AI-Enhanced Automatic Design of Efficient Underwater Gliders

Chen, Peter Yichen	MIT
Ma, Pingchuan	MIT CSAIL
Hagemann, Niklas	Massachusetts Institute of Technology
Romanishin, John	MIT
Wang, Wei	University of Wisconsin-Madison
Rus, Daniela	MIT
Matusik, Wojciech	MIT
Keywords: Deep Learning Methods, Machine Learning for Robot Control, Marine Robotics Abstract: The development of novel autonomous underwater gliders has been hindered by limited shape diversity, primarily due to the reliance on traditional design tools that depend heavily on manual trial and error. Building an automated design framework is challenging due to the complexities of representing glider shapes and the high computational costs associated with modeling complex solid-fluid interactions. In this work, we introduce an AI-enhanced automated computational framework designed to overcome these limitations by enabling the creation of underwater robots with non-trivial hull shapes. Our approach involves an algorithm that co-optimizes both shape and control signals, utilizing a reduced-order geometry representation and a differentiable neural-network-based fluid surrogate model. This end-to-end design workflow facilitates rapid iteration and evaluation of hydrodynamic performance, leading to the discovery of optimal and complex hull shapes across various control settings. We validate our method through wind tunnel experiments and swimming pool gliding tests, demonstrating that our computationally designed gliders surpass manually designed counterparts in terms of energy efficiency. By addressing challenges in efficient shape representation and neural fluid surrogate models, our work paves the way for the development of highly efficient underwater gliders, with significant implications for ocean exploration and environmental monitoring.

11:35-11:40, Paper TuBT7.5	Add to My Program
EnKode: Active Learning of Unknown Flows with Koopman Operators

Li, Alice Kate	University of Pennsylvania
Costa Silva, Thales	University of Pennsylvania
Hsieh, M. Ani	University of Pennsylvania
Keywords: Environment Monitoring and Management, Marine Robotics, Dynamics Abstract: In this letter, we address the task of adaptive sampling to model vector fields. When modeling environmental phenomena with a robot, gathering high resolution information can be resource intensive. Actively gathering data and modeling flows with the data is a more efficient alternative. However, in such scenarios, data is often sparse and thus requires flow modeling techniques that are effective at capturing the relevant dynamical features of the flow to ensure high prediction accuracy of the resulting models. To accomplish this effectively, regions with high informative value must be identified. We propose EnKode, an active sampling approach based on Koopman Operator theory and ensemble methods that can build high quality flow models and effectively estimate model uncertainty. For modeling complex flows, EnKode provides comparable or better estimates of unsampled flow regions than Gaussian Process Regression models with hyperparameter optimization. Additionally, our active sensing scheme provides more accurate flow estimates than comparable strategies that rely on uniform sampling. We evaluate EnKode using three common benchmarking systems: the Bickley Jet, Lid-Driven Cavity flow with an obstacle, and real ocean currents from the National Oceanic and Atmospheric Administration (NOAA).


TuBT8 Regular Session, 311	Add to My Program
Medical Robotics 2

Chair: Doulgeri, Zoe	Aristotle University of Thessaloniki
Co-Chair: Liu, Fei	University of Tennessee Knoxville

11:15-11:20, Paper TuBT8.1	Add to My Program
A Cylindrical Halbach Array Magnetic Actuation System for Longitudinal Robot Actuation across 2D Workplane

Sun, Hongzhe	The Chinese University of Hong Kong
Cheng, Shing Shin	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems, Automation at Micro-Nano Scales, Mechanism Design Abstract: Magnetic actuation has been widely investigated for miniature robot control due to its wireless control capability. As a permanent magnetic (PM) actuation system, the Halbach array can provide strong and controllable magnetic fields with large motion workspace. However, existing cylindrical Halbach array systems can only generate axial force along its central axis and require the workspace (i.e., patient anatomy) to be manipulated inside the system for any useful robot manipulation, severely limiting their application for robotic surgery. In this work, we introduce a cylindrical Halbach array actuation system capable of generating a magnetic field with longitudinal gradients across a 2-dimensional (2D) workplane instead of only along the central axis, effectively extending the longitudinal force actuation coverage from 1D to a 2D plane. This is achieved by optimizing the magnet sizes and roll angles of the Halbach rows arranged circumferentially around the system. Co-alignment between the field and gradient directions is also achieved through proper configuration of the magnet pitch angles along each Halbach row, resulting in tip-leading robot motion capability. A series of model-based simulations were performed during the optimization process and later verified experimentally. The actuation system was experimentally demonstrated to stably drive a 2mm diameter magnetic robot longitudinally at different locations within the workplane and at different velocities. This represents a significant advancement towards deploying cylindrical Halbach array systems for robot manipulation in clinical cases.

11:20-11:25, Paper TuBT8.2	Add to My Program
Towards Autonomous Verification: Integrating Cognitive AI and Semantic Digital Twins in Medical Robotics

Mania, Patrick	University of Bremen
Neumann, Michael	Uni Bremen
Kenghagho Kenfack, Franklin	University of Bremen
Beetz, Michael	University of Bremen
Keywords: Medical Robots and Systems, Service Robotics, Computer Vision for Medical Robotics Abstract: In medical laboratory environments, where precision and safety are critical, the deployment of autonomous robots requires not only accurate object manipulation but also the ability to verify task success to comply with regulatory requirements. This paper introduces a novel imagination-enabled perception framework that integrates cognitive AI with semantic digital twins to allow medical robots to simulate task outcomes, compare them with real-world results, and autonomously verify the success of their actions. Our approach addresses challenges related to handling small and transparent objects commonly found in sterility testing kits and other related consumables. By enhancing the RoboKudo perception system with parthood-based reasoning, we enable more accurate task verification through focused attention on object subparts. Experiments show that our system significantly improves performance compared to traditional object-centric methods, increasing accuracy in complex environments without the need for extensive retraining. This work demonstrates a novel concept in making robotic systems more adaptable and reliable for critical tasks in medical laboratories.

11:25-11:30, Paper TuBT8.3	Add to My Program
SC-Former: A Segmentation Convolution Transformer for Lung Surgery Robots

Li, Nanyu	Broncus Medical
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: For lung surgery robots, the precise segmentation of pulmonary fissures is very important. Damaging the interlobar fissures during surgery can have serious consequences. Accurately segmenting weak and abnormal fissures commonly found in clinical CT scans remains a challenging task. To solve the above problem, we aimed to develop a novel Convolution Transformer for accurate fissure segmentation (SC-Former). The proposed SC-Former adopts an encoder, attention block, and decoder structure. First, we designed an encoder with a hybrid CNNs-transformer block that ingeniously amalgamates coordinate convolution and coordinate transformer to effectively capture both local and global feature information. Second, we introduced the long skip connections of our designed attention block at layers of the decoder-encoder structure to emphasize the field of view for fissures. Third, we added the distance map strategy to alleviate the challenge of training the network to segment the false positives from the complex textures in the lung. Fourth, we developed a multi-scale supervision strategy for independent prediction at various decoder levels, effectively integrating multi-scale semantic information to facilitate the segmentation of weak and abnormal fissures. Because of the lack of open-source inter-pulmonary fissure datasets, we collected 3D CT scans from 400 participants in the clinical trial and created a new high-quality dataset: BMI dataset. Extensive experiments on this dataset revealed the great superiority of our method over several state-of-the-art competitors. The ablation study also validated the effectiveness and robustness of each part of SC-Former.

11:30-11:35, Paper TuBT8.4	Add to My Program
Passive Bilateral Surgical Teleoperation with RCM and Spatial Constraints in the Presence of Time Delays

Kastritsi, Theodora	Istituto Italiano Di Tecnologia
Prapavesis Semetzidis, Theofanis	Aristotle University of Thessaloniki
Doulgeri, Zoe	Aristotle University of Thessaloniki
Keywords: Surgical Robotics: Laparoscopy, Telerobotics and Teleoperation, Physical Human-Robot Interaction, Passivity, Stability and Performance Abstract: The primary issue in bilateral teleportation setups is the existence of communication delays, which can destabilize the system. We are addressing this challenge in the case of a bilateral leader-follower surgical setup, where the surgeon uses a haptic device as the leader robot to manipulate the surgical instrument held by a general-purpose manipulator, the follower robot. The follower robot is equipped with an elongated tool that through a small incision passes inside the patient's body, where sensitive structures may exist. These structures may include organs, arteries, or veins that require protection during surgery. To address this challenge, we propose a bilateral control framework that is proven to maintain passivity, ensure bounded tracking errors between the leader and follower robots, and impose remote center of motion and spatial constraints related with the sensitive structures, all in the presence of constant and variable communication delays. Experimental results in a virtual intraoperative environment, using a point cloud of a kidney and its surrounding vessels, demonstrate the effectiveness of our control scheme under various communication delay scenarios.


TuBT9 Regular Session, 312	Add to My Program
Motion Planning 2

Chair: Salzman, Oren	Technion
Co-Chair: Kousik, Shreyas	Georgia Institute of Technology

11:15-11:20, Paper TuBT9.1	Add to My Program
Direction Informed Trees (DIT*): Optimal Path Planning Via Direction Filter and Direction Cost Heuristic

Zhang, Liding	Technical University of Munich
Chen, Kejia	Technical University of Munich
Cai, Kuanqi	Technical University of Munich
Zhang, Yu	Technical University of Munich
Dang, Yixuan	Technische Universität München
Wu, Yansong	Technische Universität München
Bing, Zhenshan	Technical University of Munich
Wu, Fan	Technical University of Munich
Haddadin, Sami	Technical University of Munich
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Motion and Path Planning, Manipulation Planning, Task and Motion Planning Abstract: Optimal path planning requires finding a series of feasible states from the starting point to the goal to optimize objectives. Popular path planning algorithms, such as Effort Informed Trees (EIT), employ effort heuristics to guide the search. Effective heuristics are accurate and computationally efficient, but achieving both can be challenging due to their conflicting nature. This paper proposes Direction Informed Trees (DIT), a sampling-based planner that focuses on optimizing the search direction for each edge, resulting in goal bias during exploration. We define edges as generalized vectors and integrate similarity indexes to establish a directional filter that selects the nearest neighbors and estimates direction costs. The estimated direction cost heuristics are utilized in edge evaluation. This strategy allows the exploration to share directional information efficiently. DIT* convergence faster than existing single-query, sampling-based planners on tested problems in R^4 to R^16 and has been demonstrated in real-world environments with various planning tasks. A video showcasing our experimental results is available at: https://youtu.be/2SX6QT2NOek.

11:20-11:25, Paper TuBT9.2	Add to My Program
Optimal Motion Planning for a Class of Dynamical Systems

Rousseas, Panagiotis	National Technical University of Athens
Bechlioulis, Charalampos	University of Patras
Kyriakopoulos, Kostas	New York University - Abu Dhabi
Keywords: Motion and Path Planning, Optimization and Optimal Control Abstract: A novel method for optimal motion planning in the context of a class of dynamical system is proposed in this work. Our approach is based on the design of a provably safe and convergent actor structure, which is optimized via a policy iteration method. The proposed actor has wide applications, from control of mechanical systems to providing acceleration commands for more complex robotic platforms. Extra care is taken to provide theoretical guarantees, and the scheme is validated against an existing sampling-based planner.

11:25-11:30, Paper TuBT9.3	Add to My Program
Asymptotically-Optimal Multi-Query Path Planning for a Polygonal Robot

Zhang, Duo	Rutgers University
Ye, Zihe	Rutgers University
Yu, Jingjin	Rutgers University
Keywords: Motion and Path Planning, Constrained Motion Planning, Computational Geometry Abstract: Shortest-path roadmaps, also known as reduced visibility graphs, provide a highly efficient multi-query method for computing optimal paths in two-dimensional environments. Combined with Minkowski sum computations, shortest-path roadmaps can compute optimal paths for a translating robot in 2D. In this study, we explore the intuitive idea of stacking up a set of reduced visibility graphs at different orientations for a polygonal holonomic robot to support the fast computation of near-optimal paths, allowing simultaneous 2D translation and rotation. The resulting algorithm, rotation-stacked visibility graph(RVG), is shown to be resolution-complete and asymptotically optimal. Extensive computational experiments show RVG significantly outperforms state-of-the-art single- and multi-query sampling-based methods on both computation time and solution optimality fronts.

11:30-11:35, Paper TuBT9.4	Add to My Program
Asymptotically Optimal Sampling-Based Motion Planning through Anytime Incremental Lazy Bidirectional Heuristic Search

Wang, Yi	University of New Hampshire
Mu, Bingxian	University of New Hampshire
Salzman, Oren	Technion
Keywords: Motion and Path Planning Abstract: This paper introduces Bidirectional Lazy Informed Trees (BLIT), the first algorithm to incorporate anytime incremental lazy bidirectional heuristic search (Bi-HS) into batch-wise sampling-based motion planning (Bw-SBMP). BLIT operates on batches of informed states (states that can potentially improve the cost of the incumbent solution) structured as an implicit random geometric graph (RGG). The computational cost of collision detection is mitigated via a new lazy edge-evaluation strategy by focusing on states near obstacles. Experimental results, especially in high dimensions, show that BLIT* outperforms existing Bw-SBMP planners by efficiently finding an initial solution and effectively improving the quality as more computational resources are available.

11:35-11:40, Paper TuBT9.5	Add to My Program
Propagative Distance Optimization for Motion Planning

Chen, Yu	Carnegie Mellon University
Xu, Jinyun	Carnegie Mellon University
Cai, Yilin	Georgia Institute of Technology
Wong, Ting-Wei	Carnegie Mellon University
Ren, Zhongqiang	Shanghai Jiao Tong University
Choset, Howie	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Motion and Path Planning, Constrained Motion Planning Abstract: This paper focuses on the motion planning problem for serial articulated robots with revolute joints under kinematic constraints. Many motion planners leverage iterative local optimization methods but are often trapped in local minima due to non-convexity of the problem. A key reason for the non-convexity is the trigonometric term when parameterizing the kinematics using joint angles. Recent distance-based formulation can eliminate these trigonometric terms by formulating the kinematics based on distances, and has shown superior performance against classic joint angle based formulations in domains like inverse kinematics (IK). However, distance-based kinematics formulations have not yet been studied for motion planning, and naively applying them for motion planning may lead to poor computational efficiency. In particular, IK seeks one configuration while motion planning seeks a sequence of configurations, which greatly increases the scale of the underlying optimization problem. This paper proposes Propagative Distance Optimization for Motion Planning (PDOMP), which addresses the challenge by (i) introducing a new compact representation that reduces the number of variables in the distance-based formulation, and (ii) leveraging the chain structure to efficiently compute forward kinematics and Jacobians of the robot among waypoints along a path.

11:40-11:45, Paper TuBT9.6	Add to My Program
Dynamically Feasible Path Planning in Cluttered Environments Via Reachable B'ezier Polytopes

Csomay-Shanklin, Noel	California Institute of Technology
Compton, William	California Institute of Technology
Ames, Aaron	California Institute of Technology
Keywords: Motion and Path Planning, Optimization and Optimal Control, Legged Robots Abstract: The deployment of robotic systems in real world environments requires the ability to quickly produce paths through cluttered, non-convex spaces. These planned trajectories must be both kinematically feasible (i.e., collision free) and dynamically feasible (i.e., satisfy the underlying system dynamics), necessitating a consideration of both the free space and the dynamics of the robot in the path planning phase. In this work, we explore the application of reachable Bezier polytopes as an efficient tool for generating trajectories satisfying both kinematic and dynamic requirements. Furthermore, we demonstrate that by offloading specific computation tasks to the GPU, such an algorithm can meet tight real time requirements. We propose a layered control architecture that efficiently produces collision free and dynamically feasible paths for nonlinear control systems, and demonstrate the framework on the tasks of 3D hopping in a cluttered environment.


TuBT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 1

Chair: Quattrini Li, Alberto	Dartmouth College
Co-Chair: Grosu, Radu	TU Wien

11:15-11:20, Paper TuBT10.1	Add to My Program
CoPeD--Advancing Multi-Robot Collaborative Perception: A Comprehensive Dataset in Real-World Environments

Zhou, Yang	New York University
Quang, Long	U.S. DEVCOM Army Research Laboratory
Nieto-Granda, Carlos	DEVCOM U.S. Army Research Laboratory
Loianno, Giuseppe	New York University
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, Multi-Robot Systems Abstract: In the past decade, although single-robot perception has made significant advancements, the exploration of multi-robot collaborative perception remains largely unexplored. This involves fusing compressed, intermittent, limited, heterogeneous, and asynchronous environmental information across multiple robots to enhance overall perception, despite challenges like sensor noise, occlusions, and sensor failures. One major hurdle has been the lack of real-world datasets. This paper presents a pioneering and comprehensive real-world multi-robot collaborative perception dataset to boost research in this area. Our dataset leverages the untapped potential of air-ground robot collaboration featuring distinct spatial viewpoints, complementary robot mobilities, coverage ranges, and sensor modalities. It features raw sensor inputs, pose estimation, and optional high-level perception annotation, thus accommodating diverse research interests. Compared to existing datasets predominantly designed for Simultaneous Localization and Mapping (SLAM), our setup ensures a diverse range and adequate overlap of sensor views to facilitate the study of multi-robot collaborative perception algorithms. We demonstrate the value of this dataset qualitatively through multiple collaborative perception tasks. We believe this work will unlock the potential research of high-level scene understanding through multi-modal collaborative perception in multi-robot settings.

11:20-11:25, Paper TuBT10.2	Add to My Program
Generalized Synchronized Active Learning for Multi-Agent-Based Data Selection on Mobile Robotic Systems

Schmidt, Sebastian	BMW
Stappen, Lukas	BMW Group Research and Technology
Schwinn, Leo	Techincal University Munich
Günnemann, Stephan	Technical University of Munich
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Deep Learning Methods Abstract: In mobile robotics, perception in uncontrolled environments like autonomous driving is a central hurdle. Existing active learning frameworks can help enhance perception by efficiently selecting data samples for labeling but are often constrained by the necessity of full data availability in data centers, hindering real-time on-field adaptations. To address this, our work unveils a novel active learning formulation optimized for multi-robot settings. It harnesses the collaborative power of several robotic agents, considerably enhancing data acquisition and synchronization processes. Experimental evidence indicates that our approach markedly surpasses traditional active learning frameworks by up to 2.5 percent points and 90% less data uploads, delivering new possibilities for advancements in the realms of mobile robotics and autonomous systems.

11:25-11:30, Paper TuBT10.3	Add to My Program
Scenario-Based Curriculum Generation for Multi-Agent Driving

Brunnbauer, Axel	TU Wien
Berducci, Luigi	TU Wien
Priller, Peter	AVL List GmbH
Nickovic, Dejan	AIT Austrian Institute of Technology
Grosu, Radu	TU Wien
Keywords: Reinforcement Learning, Intelligent Transportation Systems, Software Tools for Benchmarking and Reproducibility Abstract: The automated generation of diversified training scenarios has been an important ingredient in many complex learning tasks, especially in real-world application domains such as autonomous driving, where auto-curriculum generation is considered vital for obtaining robust and general policies. However, crafting traffic scenarios with multiple, heterogeneous agents is typically considered a tedious and time-consuming task, especially in more complex simulation environments. To this end, we introduce MATS-Gym, a multi-agent training framework for autonomous driving that uses partial-scenario specifications to generate traffic scenarios with a variable number of agents which are executed in CARLA, a high-fidelity driving simulator. MATS-Gym reconciles scenario execution engines, such as Scenic and ScenarioRunner, with established multi-agent training frameworks where the interaction between the environment and the agents is modeled as a partially observable stochastic game. Furthermore, we integrate MATS-Gym with techniques from unsupervised environment design to automate the generation of adaptive auto-curricula, which is the first application of such algorithms to the domain of autonomous driving. The code is available at https://github.com/AutonomousDrivingExaminer/mats-gym.

11:30-11:35, Paper TuBT10.4	Add to My Program
ZeroCAP: Zero-Shot Multi-Robot Context Aware Pattern Formation Via Large Language Models

Venkatesh, L.N Vishnunandan	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Learning Categories and Concepts, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: Incorporating language comprehension into robotic operations unlocks significant advancements in robotics, but also presents distinct challenges, particularly in executing spatially oriented tasks like pattern formation. This paper introduces ZeroCAP, a novel system that integrates large language models with multi-robot systems for zero-shot context aware pattern formation. Grounded in the principles of language-conditioned robotics, ZeroCAP leverages the interpretative power of language models to translate natural language instructions into actionable robotic configurations. This approach combines the synergy of vision-language models, cutting-edge segmentation techniques and shape descriptors, enabling the realization of complex, context-driven pattern formations in the realm of multi robot coordination. Through extensive experiments, we demonstrate the systems proficiency in executing complex context aware pattern formations across a spectrum of tasks, from surrounding and caging objects to infilling regions. This not only validates the system's capability to interpret and implement intricate context-driven tasks but also underscores its adaptability and effectiveness across varied environments and scenarios. The experimental videos and additional information about this work can be found at https://sites.google.com/view/zerocap/home.

11:35-11:40, Paper TuBT10.5	Add to My Program
Multi-Robot Collaboration through Reinforcement Learning and Abstract Simulation

Labiosa, Adam	University of Wisconsin-Madison
Hanna, Josiah	University of Wisconsin -- Madison
Keywords: Reinforcement Learning, Cooperating Robots, Machine Learning for Robot Control Abstract: Teams of people coordinate to perform complex tasks by forming abstract mental models of world and agent dynamics. The use of abstract models contrasts with much recent work in robot learning that uses a high-fidelity simulator and reinforcement learning (RL) to obtain policies for physical robots. Motivated by this difference, we investigate the extent to which so-called abstract simulators can be used for multi-agent reinforcement learning (MARL) and the resulting policies successfully deployed on teams of physical robots. An abstract simulator models the robot's target task at a high-level of abstraction and discards many details of the world that could impact optimal decision-making. Policies are trained in an abstract simulator then transferred to the physical robot by making use of separately-obtained low-level perception and motion control modules. We identify three key categories of modifications to the abstract simulator that enable policy transfer to physical robots: simulation fidelity enhancements, training optimizations and simulation stochasticity. We then run an empirical study with extensive ablations to determine the value of each modification category for enabling policy transfer in cooperative robot soccer tasks. We also compare the performance of policies produced by our method with a well-tuned non-learning-based behavior architecture from the annual RoboCup competition and find that our approach leads to a similar level of performance. Broadly we show that MARL can be use to train cooperative physical robot behaviors using highly abstract models of the world.

11:40-11:45, Paper TuBT10.6	Add to My Program
Graph-Based Decentralized Task Allocation for Multi-Robot Target Localization

Peng, Juntong	Purdue University
Viswanath, Hrishikesh	Purdue University
Bera, Aniket	Purdue University
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Constrained Motion Planning Abstract: We introduce a new graph neural operator-based approach for task allocation in a system of heterogeneous robots composed of Unmanned Ground Vehicles (UGVs) and Un- manned Aerial Vehicles (UAVs). The proposed model, GATAR, or Graph Attention Task AllocatoR aggregates information from neighbors in the multi-robot system, with the aim of achieving globally optimal target localization. Being decentral- ized, our method is highly robust and adaptable to situations where the number of robots and the number of tasks may change over time. We also propose a heterogeneity-aware preprocessing technique to model the heterogeneity of the system. The experimental results demonstrate the effectiveness and scalability of the proposed approach in a range of simulated scenarios generated by varying the number of UGVs and UAVs and the number and location of the targets. We show that a single model can handle a heterogeneous robot team with the number of robots ranging between 2 and 12, while outperforming the baseline architectures.


TuBT11 Regular Session, 314	Add to My Program
Human-Robot Interaction 1

Chair: Del Bue, Alessio	Istituto Italiano Di Tecnologia
Co-Chair: Choi, Sungjoon	Korea University

11:15-11:20, Paper TuBT11.1	Add to My Program
Towards Embedding Dynamic Personas in Interactive Robots: Masquerading Animated Social Kinematic (MASK)

Park, Jeongeun	Korea University
Jeong, Taemoon	Korea University
Kim, Hyeonseong	Korea University
Byun, Taehyun	Korea University
Shin, Seungyoun	Korea University
Choi, Keunjun	Rainbow Robotics
Kwon, Jaewoon	NAVER LABS
Lee, Taeyoon	Boston Dynamics AI Institute
Pan, Matthew	Queen's University
Choi, Sungjoon	Korea University
Keywords: Social HRI, Gesture, Posture and Facial Expressions, Design and Human Factors Abstract: This paper presents the design and development of an innovative interactive robotic system to enhance audience engagement using character-like personas. Built upon the foundations of persona-driven dialog agents, this work extends the agent’s application to the physical realm, employing robots to provide a more captivating and interactive experience. The proposed system, named the Masquerading Animated Social Kinematic (MASK), leverages an anthropomorphic robot which interacts with guests using non-verbal interactions, including facial expressions and gestures. A behavior generation system based upon a finite-state machine structure effectively conditions robotic behavior to convey distinct personas. The MASK framework integrates a perception engine, a behavior selection engine, and a comprehensive action library to enable real-time, dynamic interactions with minimal human intervention in behavior design. Throughout the user subject studies, we examined whether the users could recognize the intended character in both personality- and film-character-based persona conditions. We conclude by discussing the role of personas in interactive agents and the factors to consider for creating an engaging user experience.

11:20-11:25, Paper TuBT11.2	Add to My Program
Simultaneous Dialogue Services Using Multiple Semiautonomous Robots in Multiple Locations by a Single Operator: A Field Trial on Souvenir Recommendation

Sakai, Kazuki	Osaka University
Kawata, Megumi	Osaka University
Meneses, Alexis	Osaka University
Ishiguro, Hiroshi	Osaka University
Yoshikawa, Yuichiro	Osaka University
Keywords: Social HRI, Human-Robot Collaboration Abstract: Recently, systems have emerged enabling a single operator to engage with users across multiple locations simultaneously. However, under such systems, a potential challenge exists where the operator, upon switching locations, may need to join ongoing conversations without a complete understanding of their history. Consequently, a seamless transition and the development of high-quality conversations may be impeded. This study directs its attention to the utilization of multiple robots, aiming to create a semiautonomous teleoperation system. This system enables an operator to switch between twin robots at each location as needed, thereby facilitating the provision of higher-quality dialogue services simultaneously. As an initial phase, a field experiment was conducted to assess user satisfaction with recommendations made by the operator using twin robots. Results collected from 391 participants over 13 days revealed heightened user satisfaction when the operator intervened and provided recommendations through multiple robots compared with autonomous recommendations by the robots. These findings contribute significantly to the formulation of a teleoperation system that allows a single operator to deliver multipoint conversational services.

11:25-11:30, Paper TuBT11.3	Add to My Program
Safety and Naturalness Perceptions of Robot-To-Human Handovers Performed by Data-Driven Robotic Mimicry of Human Givers

Megyeri, Ava	Wright State University
Wiederhold, Noah	Clarkson University
Liu, Yu	Clarkson University
Banerjee, Sean	Wright State Univeristy
Banerjee, Natasha Kholgade	Wright State University
Keywords: Human-Centered Robotics, Human-Robot Collaboration, Physical Human-Robot Interaction Abstract: We study human perceptions of a robot that performs robot-to-human (R2H) handovers controlled to grasp, transport, and transfer 34 objects by mimicking human givers in human-human (H2H) handover data. Recognizing the importance of human-like robotic behavior for successful collaboration, R2H studies use models of human behavior or observations of H2H data to plan robot giver motion. However, R2H studies have been limited in object counts. In this work, we use the Human-Object-Human (HOH) dataset, consisting of H2H interactions performed by 20 giver-receiver pairs with 136 objects, to conduct an R2H study with 34 objects. We teleoperate a Kinova Gen3 manipulator to grip an object as grasped by an HOH human giver, and program it to automatically transport and orient the object to a participant by mimicking the HOH giver's trajectory and transfer pose. We survey participants on safety, naturalness, and preferred choice over linear trajectory and random orientation baselines. We find that transfer pose influences perceptions of naturalness, with HOH poses showing higher naturalness ratings. Participants prefer handovers with HOH end poses when asked to pick their preferred interaction.

11:30-11:35, Paper TuBT11.4	Add to My Program
Integrating Human-Robot Teaming Dynamics into Mission Planning Tools for Transparent Tactics in Multi-Robot Human Integrated Teams

Aldridge, Audrey L.	Mississippi State University
Errico, Tyler	United States Military Academy
Morrell, Mitchell	United States Military Academy
Bethel, Cindy L.	Mississippi State University
James, John	United States Military Academy
Chewar, Christa	United States Military Academy
Novitzky, Michael	United States Military Academy
Keywords: Human-Robot Teaming, Integrated Planning and Control, Human-Robot Collaboration Abstract: This research aims to demonstrate how integrating human-robot teaming dynamics into mission planning tools impacts the abilities of robot operators as they coordinate multiple robot agents during a mission. This was investigated in a pilot study using two inter-robot collaboration modalities and interface tools, which required different human-robot interaction techniques to execute a mission with a team of four robots. In the first modality, the operator manually inserted waypoints for each robot, as they acted as individual agents. In the second modality, the operator used the Planning Execution to After-Action Review (PETAAR) toolset to plot a single waypoint for the team of robots, as the robots coordinated their movement as a group. One novel component of this study is the investigation of how human-robot teaming dynamics and the PETAAR toolset impacted robot operators' real-time situation awareness and perceived cognitive load as well as team performance. Although the teaming modalities differed greatly with respect to the level of operator input needed, the time required to complete the simulation, the participant’s perceived cognitive load, and interface usability were very similar for both modalities. In contrast, the results revealed statistically significant differences between the two teaming modalities related to participants’ abilities to maintain a wedge formation while remaining situationally aware. Results from this work will be used to guide development of PETAAR along with the design of future studies investigating more complex teaming scenarios and for creating a baseline for comparing future results.

11:35-11:40, Paper TuBT11.5	Add to My Program
XBG: End-To-End Imitation Learning for Autonomous Behaviour in Human-Robot Interaction and Collaboration

Cardenas Perez, Carlos Andres	Italian Institute of Technology
Romualdi, Giulio	Istituto Italiano Di Tecnologia
Elobaid, Mohamed	Fondazione Istituto Italiano Di Tecnologia
Dafarra, Stefano	Istituto Italiano Di Tecnologia
L'Erario, Giuseppe	Istituto Italiano Di Tecnologia
Traversaro, Silvio	Istituto Italiano Di Tecnologia
Morerio, Pietro	Istituto Italiano Di Tecnologia
Del Bue, Alessio	Istituto Italiano Di Tecnologia
Pucci, Daniele	Italian Institute of Technology
Keywords: AI-Enabled Robotics, Imitation Learning, Humanoid Robot Systems Abstract: This paper presents XBG (eXteroceptive Behaviour Generation), a multimodal end-to-end Imitation Learning (IL) system for a whole-body autonomous humanoid robot used in real-world Human-Robot Interaction (HRI) scenarios. The main contribution is an architecture for learning HRI behaviours using a data-driven approach. A diverse dataset is collected via teleoperation, covering multiple HRI scenarios, such as handshaking, handwaving, payload reception, walking, and walking with a payload. After synchronizing, filtering, and transforming the data, Deep Neural Networks (DNN) are trained, integrating exteroceptive and proprioceptive information to help the robot understand both its environment and its actions. The robot takes in sequences of images (RGB and depth) and joints state information to react accordingly. By fusing multimodal signals over time, the model enables autonomous capabilities in a robotic platform. The models are evaluated based on the success rates in the mentioned HRI scenarios and they are deployed on the ergoCub humanoid robot. XBG achieves success rates between 60% and 100% even when tested in unseen environments.


TuBT12 Regular Session, 315	Add to My Program
Calibration 2

Chair: Yip, Michael C.	University of California, San Diego
Co-Chair: Hwang, Hyoseok	Kyung Hee University

11:15-11:20, Paper TuBT12.1	Add to My Program
CoL3D: Collaborative Learning of Single-View Depth and Camera Intrinsics for Metric 3D Shape Recovery

Zhang, Chenghao	Alibaba Cloud
Fan, Lubin	Alibaba Cloud
Cao, Shen	Alibaba Cloud
Wu, Bojian	Independent Researcher
Ye, Jieping	Alibaba Cloud
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Calibration and Identification Abstract: Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied intelligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usually, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.

11:20-11:25, Paper TuBT12.2	Add to My Program
Non-Destructive 3D Root Structure Modeling

Lu, Guoyu	University of Georgia
Keywords: Deep Learning for Visual Perception, Visual Learning, Sensor Fusion Abstract: Deep neural networks (DNNs) have gained significant attention in 3D object reconstruction. However, detecting and reconstructing hidden or buried objects underground remains a challenging task. Ground Penetrating Radar (GPR) has emerged as a cost-effective and non-destructive technology for subsurface object detection, including soil structures, pipelines, and plant roots. In this study, we present a deep convolutional neural network-based method for detecting target signals and performing curve parameter regression using multiple B-scans from GPR data. By leveraging the detection and regression outcomes, we further generate fitted curves that represent underground structures. To reconstruct a comprehensive and detailed 3D root structure, we design a shape reconstruction network that takes sparse sliced 3D points as input. The proposed approach is extensively trained and validated using synthetic 3D root datasets and simulated GPR data generated with gprMax. Additionally, the trained model demonstrates strong generalization capabilities when applied to real-world GPR data, ensuring its practical applicability.

11:25-11:30, Paper TuBT12.3	Add to My Program
PTZ-Calib: Robust Pan-Tilt-Zoom Camera Calibration

Guo, Jinhui	Alibaba Cloud
Fan, Lubin	Alibaba Cloud
Wu, Bojian	Independent Researcher
Gu, Jiaqi	Alibaba Cloud
Cao, Shen	Alibaba Cloud
Ye, Jieping	Alibaba Cloud
Keywords: Surveillance Robotic Systems, Calibration and Identification, SLAM Abstract: In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets.

11:30-11:35, Paper TuBT12.4	Add to My Program
CtRNet-X: Camera-To-Robot Pose Estimation in Real-World Conditions Using a Single Camera

Lu, Jingpei	University of California San Diego
Liang, Zekai	Univeristy of California, San Diego
Xie, Tristin	University of California San Diego
Richter, Florian	University of California, San Diego
Lin, Shan	Arizona State University
Liu, Sainan	Intel
Yip, Michael C.	University of California, San Diego
Keywords: Visual Tracking, Perception for Grasping and Manipulation, Computer Vision for Automation Abstract: Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice robots usually move in and out of view and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.

11:35-11:40, Paper TuBT12.5	Add to My Program
Foundation Feature-Driven Online End-Effector Pose Estimation: A Marker-Free and Learning-Free Approach

Wu, Tianshu	Peking University
Zhang, Jiyao	Peking University
Liang, Sheldon	Carnegie Mellon University, Peking University
Han, Zhengxiao	Northwestern University
Dong, Hao	Peking University
Keywords: Visual Servoing, Calibration and Identification, Visual Tracking Abstract: Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross-end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from CAD models and reference images, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multihistorical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.

11:40-11:45, Paper TuBT12.6	Add to My Program
Camera-LiDAR Extrinsic Calibration Using Constrained Optimization with Circle Placement

Kim, Daeho	Kyung Hee University
Shin, Seunghui	Kyung Hee University
Hwang, Hyoseok	Kyung Hee University
Keywords: Calibration and Identification, Sensor Fusion, Intelligent Transportation Systems Abstract: Monocular camera-LiDAR data fusion has demonstrated remarkable environmental perception capabilities in various fields. The success of data fusion relies on the accurate matching of correspondence features from images and point clouds. In this letter, we propose a target-based Camera-LiDAR extrinsic calibration by matching correspondences in both data. Specifically, to extract accurate features from the point cloud, we propose a novel method that estimates the circle centers by optimizing the probability distribution from the initial position. This optimization involves generating the probability distribution of circle centers from circle edge points and using the Lagrangian multiplier method to estimate the optimal positions of the circle centers. We conduct two types of experiments: simulations for quantitative results and real system eval uations for qualitative assessment. Our method demonstrates a 21% improvement in simulation calibration performance for 20 target poses with LiDAR noise of 0.03 m compared to existing methods, and also shows high visual quality in reprojecting point cloud onto images in real-world scenarios. Codes are available at https://github.com/AIRLABkhu/SquareCalib.


TuBT13 Regular Session, 316	Add to My Program
Assistive Robotics 2

Chair: Gregg, Robert D.	University of Michigan
Co-Chair: Ha, Sehoon	Georgia Institute of Technology

11:15-11:20, Paper TuBT13.1	Add to My Program
A Modeling and Control Strategy for the Gaze-Guided Teleoperation of Robotic Manipulators Via Smart Glasses

Lawson, Andrew	University of North Carolina Wilmington
Saeidi, Hamed	University of North Carolina Wilmington
Keywords: Human-Centered Robotics, Human Performance Augmentation, Telerobotics and Teleoperation Abstract: Object manipulation is a high-frequency task required in assistive robotic systems in order to aid the elderly or those with disabilities that impact motor control. In the instance where arms cannot be used to command a robot, gaze-tracking via smart glasses is a suitable candidate. In this work, we develop a modeling method and model-based filtering and control strategy for direct gaze-guided teleoperation of robotic manipulators. We demonstrate the feasibility of this control strategy in an object manipulation case study with six participants. The results indicate that a model-based gaze filtering and control strategy produces smooth commands for the robot that are easy for the participants to use. These methods can reduce the perceived workload of the user by 37.51% and lower the gripper positioning error by 39.09% compared to using unfiltered gaze data.

11:20-11:25, Paper TuBT13.2	Add to My Program
Unlocking Potential: Gaze-Based Interfaces in Assistive Robotics for Users with Severe Speech and Motor Impairment

Vishwakarma, Himanshu	Indian Institute of Science
Mitra, Mukund	IISc Bangalore
Vinay Krishna Sharma, Vinay Krishna	Indian Institute of Science
Sulthan, Jabeen	IIT Kanpur
Atulkar, Aniruddha	Indian Institute of Science Bangalore
Bhathad, Dinesh	Indiana University
Biswas, Pradipta	Indian Institute of Science
Keywords: Virtual Reality and Interfaces, Human-Robot Collaboration, Product Design, Development and Prototyping Abstract: Individuals with Severe Speech and Motor Impairment (SSMI) struggle to interact with their surroundings due to physical and communicative limitations. To address these challenges, this paper presents a gaze-controlled robotic system that helps SSMI users perform stamp printing tasks. The system includes gaze-controlled interfaces and a robotic arm with a gripper, designed specifically for SSMI users to enhance accessibility and interaction. User studies with gaze-controlled interfaces such as video see-through (VST), video pass-through (VPT), and optical see-through (OST) displays demonstrated the system's effectiveness. Results showed that VST had the average stamping time of 28.45 s (SD = 15.44 s) and the average stamp count 7.36 (SD = 3.83), outperforming VPT and OST.

11:25-11:30, Paper TuBT13.3	Add to My Program
Do Looks Matter? Exploring Functional and Aesthetic Design Preferences for a Robotic Guide Dog

Cohav, Aviv	Georgia Institute of Technology
Gong, Xinran	Georgia Institute of Technology
Kim, Joanne Taery	Georgia Institute of Technology
Zeagler, Clint	Georgia Tech
Ha, Sehoon	Georgia Institute of Technology
Walker, Bruce	Georgia Tech
Keywords: Design and Human Factors, Human-Centered Robotics, Physically Assistive Devices Abstract: Dog guides offer an effective mobility solution for blind or visually impaired (BVI) individuals, but conventional dog guides have limitations including the need for care, potential distractions, societal prejudice, high costs, and limited availability. To address these challenges, we seek to develop a robot dog guide capable of performing the tasks of a conventional dog guide, enhanced with additional features. In this work, we focus on design research to identify functional and aesthetic design concepts to implement into a quadrupedal robot. The aesthetic design remains relevant even for BVI users due to their sensitivity toward societal perceptions and the need for smooth integration into society. We collected data through interviews and surveys to answer specific design questions pertaining to the appearance, texture, features, and method of controlling and communicating with the robot. Our study identified essential and preferred features for a future robot dog guide, which are supported by relevant statistics aligning with each suggestion. These findings will inform the future development of user-centered designs to effectively meet the needs of BVI individuals.

11:30-11:35, Paper TuBT13.4	Add to My Program
Comparison of Three Interface Approaches for Gaze Control of Assistive Robots for Individuals with Tetraplegia

Nunez Sardinha, Emanuel	Bristol Robotics Lab, University of the West of England
Zook, Nancy	University of the West of England
Ruiz Garate, Virginia	University of Mondragon
Western, David	University of Bristol
Munera, Marcela	University of West England
Keywords: Physically Assistive Devices, Grasping, Telerobotics and Teleoperation Abstract: Individuals with tetraplegia have their independence and quality of life severely affected. Assistive robotic arms can enhance their autonomy, but effective control interfaces are essential for optimizing their usability and performance. This study aims to evaluate the performance and user experience of three control interfaces for an assistive robotic arm: Graphical User Interfaces (GUI), Embedded Interface, and Directional Gaze. Thirty-three able-bodied participants were recruited to control an assistive robotic arm through the three different interfaces in a between-subjects experiment. Performance was measured using the Yale-CMU-Berkeley (YCB) Block Pick and Place Protocol. Usability (SUS) and task, workload (NASA-TLX) were measured through subjective questionnaires. Additionally, we report saccades per minute and fixation duration. The results revealed statistically significant differences showing that Embedded and GUI interfaces, when compared to the Directional Gaze interface, can lead to lower workloads and higher performance in pick-up tasks.

11:35-11:40, Paper TuBT13.5	Add to My Program
A Laser-Guided Interaction Interface for Providing Effective Robot Assistance to People with Upper Limbs Impairments

Torielli, Davide	Humanoids and Human Centered Mechatronics (HHCM), Istituto Itali
Bertoni, Liana	Italian Institute of Technology
Muratore, Luca	Istituto Italiano Di Tecnologia
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Keywords: Physically Assistive Devices, Human-Robot Collaboration, Visual Servoing Abstract: Robotics has shown significant potential in assisting people with disabilities to enhance their independence and involvement in daily activities. Indeed, a societal long-term impact is expected in home-care assistance with the deployment of intelligent robotic interfaces. This work presents a human-robot interface developed to help people with upper limbs impairments, such as those affected by stroke injuries, in activities of everyday life. The proposed interface leverages on a visual servoing guidance component, which utilizes an inexpensive but effective laser emitter device. By projecting the laser on a surface within the workspace of the robot, the user is able to guide the robotic manipulator to desired locations, to reach, grasp and manipulate objects. Considering the targeted users, the laser emitter is worn on the head, enabling to intuitively control the robot motions with head movements that point the laser in the environment, which projection is detected with a neural network based perception module. The interface implements two control modalities: the first allows the user to select specific locations directly, commanding the robot to reach those points; the second employs a paper keyboard with buttons that can be virtually pressed by pointing the laser at them. These buttons enable a more direct control of the Cartesian velocity of the end-effector and provides additional functionalities such as commanding the action of the gripper. The proposed interface is evaluated in a series of manipulation tasks involving a 6DOF assistive robot manipulator equipped with 1DOF beak-like gripper. The two interface modalities are combined to successfully accomplish tasks requiring bimanual capacity that is usually affected in people with upper limbs impairments.


TuBT14 Regular Session, 402	Add to My Program
Wearable Robotics 1

Chair: Masia, Lorenzo	Technische Universität München (TUM)
Co-Chair: Zhang, Haohan	University of Utah

11:15-11:20, Paper TuBT14.1	Add to My Program
Gravity Compensation Method for Whole Body-Mounted Robot with Contact Force Distribution Sensor

Masaoka, Shinichi	Nagoya University
Funabora, Yuki	Nagoya University
Doki, Shinji	Nagoya University
Keywords: Wearable Robotics, Force Control, Physically Assistive Devices Abstract: The emergence of sheet-type force distribution sensors has allowed direct measurement of contact force. We developed a wearable assistive robot that can directly measure contact force and investigated the gravity compensation effect of contact-force-based control. For conventional robots that do not measure the force acting between the robot and the human body (contact force) directly, a precise robot model is required for gravity compensation, which is difficult to implement in software. In the first experiment, we examined a method of gravity compensation using only joint sensors in torque-based control, which is a common conventional method, and assessed the difficulty of this method. In the next experiment, which involved one healthy subject, we confirmed that contact-force-based control has a significant gravity compensation effect without requiring a rigorous robot model. Experiments with two additional healthy subjects using the same parameters revealed that even rough parameter tuning can produce a gravity compensation effect. This study not only proposes a simplified gravity compensator for wearable assistive robots but also demonstrates the robustness of parameter tuning in contact-force-based control under static conditions. Based on the findings of this study, we will further study the possibility of other kinds of disturbance compensation and dynamic conditions in the future.

11:20-11:25, Paper TuBT14.2	Add to My Program
Unsupervised Domain Adaptation for Gait State Estimation

Medrano, Roberto	University of Michigan
Thomas, Gray	Texas A&M University
Rouse, Elliott	University of Michigan
Keywords: Prosthetics and Exoskeletons, Machine Learning for Robot Control, Sensor-based Control Abstract: Exoskeleton controllers have recently employed machine learning (ML) techniques to provide appropriate assistance throughout the terrains of the real world. One successful approach has been to learn a mapping between an exoskeleton wearer's kinematic measurements and a gait state vector that encodes how the wearer is currently walking (i.e. gait phase, speed), and then dynamically update the assistance based on the gait state. However, these methods require paired datasets of input kinematics to output gait states, which usually involves manual, time-consuming labeling of data from participants wearing specific exoskeletons and thus limits the scalability of these ML methods. A prior solution to this challenge---leveraging large pre-labeled datasets of normative human walking---introduces another problem, in that networks trained on these datasets learn only normative locomotion patterns, and thus may deteriorate when the data are changed by wearing the exoskeleton itself. In this context, we present an unsupervised-learning-based approach to both bypass the requirement of labeled data for gait state prediction and address the difficulty of domain adaptation from normative to exoskeleton-assisted walking. We validate our method in a set of walking simulations that featured exoskeleton data from 14 participants. This model showed significant improvements in state estimation relative to a model trained solely on pre-labeled normative walking, while also not requiring ground truth labels. This work presents a foundation that demonstrates labeled, device-specific data may not be required for predicting walking behavior in real time.

11:25-11:30, Paper TuBT14.3	Add to My Program
Anti-Sensing: Defense against Unauthorized Radar-Based Human Vital Sign Sensing with Physically Realizable Wearable Oscillators

Tasnim Oshim, Md Farhan	University of Massachusetts Amherst
Doering, Nigel	University of California San Diego
Islam, Bashima	Worcester Polytechnic Institute
Weng, Tsui-Wei	UCSD
Rahman, Tauhidur	University of California San Diego
Keywords: Wearable Robotics, Physically Assistive Devices, Human-Centered Robotics Abstract: Recent advancements in Ultra-Wideband (UWB) radar technology have enabled contactless, non-line-of-sight vital sign monitoring, making it a valuable tool for healthcare. However, UWB radar's ability to capture sensitive physiological data, even through walls, raises significant privacy concerns, particularly in human-robot interactions and autonomous systems that rely on radar for sensing human presence and physiological functions. In this paper, we present Anti-Sensing, a novel defense mechanism designed to prevent unauthorized radar-based sensing. Our approach introduces physically realizable perturbations, such as oscillatory motion from wearable devices, to disrupt radar sensing by mimicking natural cardiac motion, thereby misleading heart rate (HR) estimations. We develop a gradient-based algorithm to optimize the frequency and spatial amplitude of these oscillations for maximal disruption while ensuring physiological plausibility. Through both simulations and real-world experiments with radar data and neural network-based HR sensing models, we demonstrate the effectiveness of Anti-Sensing in significantly degrading model accuracy, offering a practical solution for privacy preservation.

11:30-11:35, Paper TuBT14.4	Add to My Program
Vision-Based Fuzzy Control System with Intention Detection for Smart Walkers: Enhancing Usability for Stroke Survivors with Unilateral Upper Limb Impairments

Abdollah Chalaki, Mahdi	University of Alberta
Zakerimanesh, Amir	University of Alberta
Soleymani, Abed	University of Alberta
Mushahwar, Vivian K.	University of Alberta
Tavakoli, Mahdi	University of Alberta
Keywords: Rehabilitation Robotics, Medical Robots and Systems, Physical Human-Robot Interaction Abstract: Mobility impairments, particularly those caused by stroke-induced hemiparesis, significantly impact independence and quality of life. Current smart walker controllers operate by using input forces from the user to control linear motion and input torques to dictate rotational movement; however, because they predominantly rely on user-applied torque exerted on the device handle as an indicator of user intent to turn, they fail to adequately accommodate users with unilateral upper limb impairments. This leads to increased physical strain and cognitive load. This paper introduces a novel smart walker equipped with a fuzzy control algorithm that leverages shoulder abduction angles to intuitively interpret user intentions using just one functional hand. By integrating a force sensor and stereo camera, the system enhances walker responsiveness, usability, and safety. Experimental evaluations with five participants demonstrated that the fuzzy controller significantly reduced wrist torque and improved user comfort compared to traditional admittance controllers. Results confirmed a strong correlation between shoulder abduction angles and directional intent, with users reporting decreased effort and enhanced ease of use. This study contributes to assistive robotics by providing an adaptable control mechanism for smart walkers, suggesting a pathway towards enhancing mobility and independence for individuals with mobility impairments. Project page: https://tbs-ualberta.github.io/fuzzy-sw/

11:35-11:40, Paper TuBT14.5	Add to My Program
A Lower Limb Wearable Exosuit for Improved Sitting, Standing, and Walking Efficiency

Zhang, Xiaohui	Heidelberg University
Tricomi, Enrica	Heidelberg University
Ma, Xunju	Beijing Institute of Technology
Gomez-Correa, Manuela	Instituto Politecnico Nacional
Ciaramella, Alessandro	Università Di Pisa
Missiroli, Francesco	Heidelberg University
Miskovic, Luka	Jožef Stefan Institute
Su, Huimin	Heidelberg University
Masia, Lorenzo	Technische Universität München (TUM)
Keywords: Wearable Robots, Modeling, Control, and Learning for Soft Robots, Human Performance Augmentation, Adaptive Lower Limb Assistance Control Abstract: Sitting, standing, and walking are fundamental activities crucial for maintaining independence in daily life. However, aging or lower limb injuries can impede these activities, posing obstacles to individuals' autonomy. In response to this challenge, we developed the LM-Ease, a compact and soft wearable robot designed to provide hip assistance. Its purpose is to aid users in carrying out essential daily activities such as sitting, standing, and walking. The LM-Ease features a fully-actuated tendon-driven system that seamlessly transitions between assistance actuation profiles tailored for sitting, standing, and walking movements. This device provides the user with gravity support during stand-to-sit, and offers hip extension assistance pulling force during sit-to-stand and walking. Our preliminary results show that with the LM-Ease, healthy young adults (n = 8) had significantly lower muscle activation: average reduction of 15.6% during stand-to-sit and 17.8% during sit-to-stand. Furthermore, with LM-Ease, participants demonstrated a 12.7% reduction in metabolic cost during ground walking.

11:40-11:45, Paper TuBT14.6	Add to My Program
Towards Shape-Adaptive Attachment Design for Wearable Devices Using Granular Jamming

Brignone, Joseph	University of Utah
Lancaster, Logan	University of Utah
Battaglia, Edoardo	University of Utah
Zhang, Haohan	University of Utah
Keywords: Wearable Robotics, Soft Robot Applications Abstract: Attaching a wearable device to the user's body for comfort and function while accommodating the differences and changes in body shapes often represents a challenge. In this paper, we propose an approach that addresses this problem through granular jamming, where a granule-filled membrane stiffens by rapidly decreasing the internal air pressure (e.g., vacuum), causing the granule material to be jammed together due to friction. This structure was used to conform to complex shapes of the human body when it is in the soft state while switching to the rigid state for proper robot functions by jamming the granules via vacuum. We performed an experiment to systematically investigate the effect of multiple design parameters on the ability of such jamming-based interfaces to hold against a lateral force. Specifically, we developed a bench prototype where modular granular-jamming structures are attached to objects of different sizes and shapes via a downward suspension force. Our data showed that the use of jamming is necessary to increase the overall structure stability by 1.73 to 2.16 N. Furthermore, using three modules, high suspension force, and a low membrane infill (~25%) also contribute to high resistance to lateral force. Our results lay a foundation for future implementation of wearable attachments using granular-jamming structures.


TuBT15 Regular Session, 403	Add to My Program
Robot Mapping 1

Chair: Mangelson, Joshua	Brigham Young University
Co-Chair: Henderson, Thomas C.	University of Utah

11:15-11:20, Paper TuBT15.1	Add to My Program
EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

Zhou, Zhen	Chinese Academy of Sciences Institute of Automation
Ma, Yunkai	Institute of Automation, Chinese Academy of Sciences
Fan, Junfeng	Institute of Automation, Chinese Academy of Sciences
Zhang, Shaolin	Institute of Automation, Chinese Academy of Sciences
Jing, Fengshui	Institute of Automation, CAS
Tan, Min	Institute of Automation, Chinese Academy of Sciences
Keywords: Mapping, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.

11:20-11:25, Paper TuBT15.2	Add to My Program
Neural Surface Reconstruction and Rendering for LiDAR-Visual Systems

Liu, Jianheng	The University of Hong Kong
Zheng, Chunran	The University of Hong Kong
Wan, YunFei	The University of Hong Kong
Wang, Bowen	University of Hong Kong
Cai, Yixi	KTH Royal Institute of Technology
Zhang, Fu	University of Hong Kong
Keywords: Mapping, Visual Learning, RGB-D Perception Abstract: This paper presents a unified surface reconstruction and rendering framework for LiDAR-visual systems, integrating Neural Radiance Fields (NeRF) and Neural Distance Fields (NDF) to recover both appearance and structural information from posed images and point clouds. We address the structural visible gap between NeRF and NDF by utilizing a visible-aware occupancy map to classify space into the free, occupied, visible unknown, and background regions. This classification facilitates the recovery of a complete appearance and structure of the scene. We unify the training of the NDF and NeRF using a spatial-varying scale SDF-to-density transformation for levels of detail for both structure and appearance. The proposed method leverages the learned NDF for structure-aware NeRF training by an adaptive sphere tracing sampling strategy for accurate structure rendering. In return, NeRF further refines structural in recovering missing or fuzzy structures in the NDF. Extensive experiments demonstrate the superior quality and versatility of the proposed method across various scenarios. To benefit the community, the codes will be released at url{https://github.com/hku-mars/M2Mapping}.

11:25-11:30, Paper TuBT15.3	Add to My Program
LVBA: LiDAR-Visual Bundle Adjustment for RGB Point Cloud Mapping

Li, Rundong	University of Hong Kong
Liu, Xiyuan	The University of Hong Kong
Li, Haotian	The University of Hong Kong
Liu, Zheng	University of Hong Kong
Lin, Jiarong	The University of Hong Kong
Cai, Yixi	KTH Royal Institute of Technology
Zhang, Fu	University of Hong Kong
Keywords: Mapping, SLAM Abstract: Point cloud maps with accurate color are crucial in robotics and mapping applications. Existing approaches for producing RGB-colorized maps are primarily based on real-time localization using filter-based estimation or sliding window optimization, which may lack accuracy and global consistency. In this work, we introduce a novel global LiDAR-Visual bundle adjustment (BA) named LVBA to improve the quality of RGB point cloud mapping beyond existing baselines. LVBA first optimizes LiDAR poses via a global LiDAR BA, followed by a photometric visual BA incorporating planar features from the LiDAR point cloud for camera pose optimization. Additionally, to address the challenge of map point occlusions in constructing optimization problems, we implement a novel LiDAR-assisted global visibility algorithm in LVBA. To evaluate the effectiveness of LVBA, we conducted extensive experiments by comparing its mapping quality against existing state-of-the-art baselines (i.e., R3LIVE and FAST-LIVO). Our results prove that LVBA can proficiently reconstruct high-fidelity, accurate RGB point cloud maps, outperforming its counterparts.

11:30-11:35, Paper TuBT15.4	Add to My Program
LiDAR-Enhanced 3D Gaussian Splatting Mapping

Shen, Jian	WuHan University
Yu, Huai	Wuhan University
Wu, Ji	Wuhan University
Yang, Wen	Wuhan University
Xia, Gui-Song	Wuhan University
Keywords: Mapping, SLAM Abstract: This paper introduces LiGSM, a novel LiDAR-enhanced 3D Gaussian Splatting (3DGS) mapping framework that improves the accuracy and robustness of 3D scene mapping by integrating LiDAR data. LiGSM constructs joint loss from images and LiDAR point clouds to estimate the poses and optimize their extrinsic parameters, enabling dynamic adaptation to variations in sensor alignment. Furthermore, it leverages LiDAR point clouds to initialize 3DGS, providing a denser and more reliable starting points compared to sparse SfM points. In scene rendering, the framework augments standard image-based supervision with depth maps generated from LiDAR projections, ensuring an accurate scene representation in both geometry and photometry. Experiments on public and self-collected datasets demonstrate that LiGSM outperforms comparative methods in pose tracking and scene rendering.

11:35-11:40, Paper TuBT15.5	Add to My Program
Depth-Visual-Inertial (DVI) Mapping System for Robust Indoor 3D Reconstruction

Hamesse, Charles	Royal Military Academy
Vlaminck, Michiel	Ghent University
Luong, Hiep	Ghent University
Haelterman, Rob	Royal Military Academy
Keywords: RGB-D Perception, Mapping, Search and Rescue Robots Abstract: We propose the Depth-Visual-Inertial (DVI) Mapper: a robust multi-sensor fusion framework for dense 3D mapping using time-of-flight cameras equipped with RGB and IMU sensors. Inspired by recent developments in real-time LiDAR-based odometry and mapping, our system uses an error-state iterative Kalman filter for state estimation: it processes the inertial sensor's data for state propagation, followed by a state update first using visual-inertial odometry, then depth-based odometry. This sensor fusion scheme makes our system robust to degenerate scenarios (e.g. lack of visual or geometrical features, fast rotations) and to noisy sensor data, like those that can be obtained with off-the-shelf time-of-flight DVI sensors. For evaluation, we propose the new Bunker DVI Dataset, featuring data from multiple DVI sensors recorded in challenging conditions reflecting search-and-rescue operations. We show the superior robustness and precision of our method against previous work. Following the open science principle, we make both our source code and dataset publicly available.

11:40-11:45, Paper TuBT15.6	Add to My Program
MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

Du, Zhenhua	National University of Defense Technology
Xu, Binbin	University of Toronto
Zhang, Haoyu	National University of Defense Technology
Huo, Kai	National University of Defense Technology
Zhi, Shuaifeng	National University of Defense Technology
Keywords: Semantic Scene Understanding, Representation Learning, Deep Learning for Visual Perception Abstract: Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.


TuBT16 Regular Session, 404	Add to My Program
Manipulation 2

Chair: Zhou, Jianshu	University of California, Berkeley
Co-Chair: Tiomkin, Stas	Texas Tech University

11:15-11:20, Paper TuBT16.1	Add to My Program
Pushing in the Dark: A Reactive Pushing Strategy for Mobile Robots Using Tactile Feedback

Ozdamar, Idil	HRI2 Lab., Istituto Italiano Di Tecnologia. Dept. of Informatics
Sirintuna, Doganay	HRI2 Lab., Istituto Italiano Di Tecnologia. Dept. of Informatics
Arbaud, Robin	HRI2 Lab., Istituto Italiano Di Tecnologia ; Dept. of Informatic
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Mobile Manipulation, Manipulation Planning Abstract: For mobile robots, navigating cluttered or dynamic environments often necessitates non-prehensile manipulation, particularly when faced with objects that are too large, irregular, or fragile to grasp. The unpredictable behavior and varying physical properties of these objects significantly complicate manipulation tasks. To address this challenge, this manuscript proposes a novel Reactive Pushing Strategy. This strategy allows a mobile robot to dynamically adjust its base movements in real-time to achieve successful pushing maneuvers towards a target location. Notably, our strategy adapts the robot motion based on changes in contact location obtained through the tactile sensor covering the base, avoiding dependence on object-related assumptions and its modeled behavior. The effectiveness of the Reactive Pushing Strategy was initially evaluated in the simulation environment, where it significantly outperformed the compared baseline approaches. Following this, we validated the proposed strategy through real-world experiments, demonstrating the robot capability to push objects to the target points located in the entire vicinity of the robot. In both simulation and real-world experiments, the object-specific properties (shape, mass, friction, inertia) were altered along with the changes in target locations to assess the robustness of the proposed method comprehensively.

11:20-11:25, Paper TuBT16.2	Add to My Program
Foresee and Act Ahead: Task Prediction and Pre-Scheduling Enabled Efficient Robotic Warehousing

Cao, Bo	MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai J
Liu, Zhe	Shanghai Jiao Tong University
Han, Xingyao	Shanghai Jiao Tong University
Zhou, Shunbo	Huawei
Zhang, Heng	Huawei
Han, Lijun	Shanghai Jiao Tong University
Wang, Lin	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Manipulation Planning, Intelligent Transportation Systems Abstract: In warehousing systems, to enhance efficiency amid surging demand volumes, much attention has been placed on how to reasonably allocate tasks of delivery to robots. However, the labor of robots is still inevitably wasted to some extent. In this paper, we propose a pre-scheduling enhanced warehousing framework aiming to foresee and act in advance, which consists of task flow prediction and hybrid task allocation. For task prediction, we design the spatio-temporal representations of the task flow and introduce a periodicity-decoupled mechanism tailored for the generation patterns of aggregated orders, and then further extract spatial features of task distribution with a novel combination of graph structures. In hybrid tasks allocation, we consider the known tasks and predicted future tasks simultaneously to optimize the task allocation. In addition, we consider factors such as predicted task uncertainty and sector-level efficiency to realize more balanced and rational allocations. We validate our task prediction model across datasets derived from factories, achieving SOTA performance. Furthermore, we implement our system in a real-world robotic warehouse, demonstrating more than 30% improvements in efficiency.

11:25-11:30, Paper TuBT16.3	Add to My Program
Embodiment-Agnostic Action Planning Via Object-Part Scene Flow

Tang, Weiliang	The Chinese University of Hong Kong
Pan, Jia-Hui	The Chinese University of Hong Kong
Zhan, Wei	Univeristy of California, Berkeley
Zhou, Jianshu	University of California, Berkeley
Yao, Huaxiu	UNC-Chapel Hill
Liu, Yunhui	Chinese University of Hong Kong
Tomizuka, Masayoshi	University of California
Ding, Mingyu	UC Berkeley
Fu, Chi-Wing	The Chinese University of Hong Kong
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation, Manipulation Planning Abstract: Observing that the key for robotic action planning is to understand the target-object motion when its associated part is manipulated by the end effector, we propose to generate the 3D object-part scene flow and extract its transformations to solve the action trajectories for diverse embodiments. The advantage of our approach is that it derives the robot action explicitly from object motion prediction, yielding a more robust policy by understanding the object motions. Also, beyond policies trained on embodiment-centric data, our method is embodiment-agnostic, generalizable across diverse embodiments, and being able to learn from human demonstrations. Our method comprises three components: an object-part predictor to locate the part for the end effector to manipulate, an RGBD video generator to predict future RGBD videos, and a trajectory planner to extract embodiment-agnostic transformation sequences and solve the trajectory for diverse embodiments. Trained on videos even without trajectory data, our method still outperforms existing works significantly by 27.7% and 26.2% on the prevailing virtual environments MetaWorld and Franka-Kitchen, respectively. Furthermore, we conducted real-world experiments, showing that our policy, trained only with human demonstration, can be deployed to various embodiments.

11:30-11:35, Paper TuBT16.4	Add to My Program
Acoustic Wave Manipulation through Sparse Robotic Actuation

Shah, Tristan	San Jose State University
Smilovich, Noam	San Jose State University
Amirkulova, Feruza	San Jose State University
Gerges, Samer	San Jose State University
Tiomkin, Stas	Texas Tech University
Keywords: Manipulation Planning, Model Learning for Control Abstract: Recent advancements in robotics, control, and machine learning have facilitated progress in the challenging area of object manipulation. These advancements include, among others, the use of deep neural networks to represent dynamics that are partially observed by robot sensors, as well as effective control using sparse control signals. In this work, we explore a more general problem: the manipulation of acoustic waves, which are partially observed by a robot capable of influencing the waves through spatially sparse actuators. This problem holds great potential for the design of new artificial materials, ultrasonic cutting tools, energy harvesting, and other applications. We develop an efficient data-driven method for robot learning that is applicable to either focusing scattered acoustic energy in a designated region or suppressing it, depending on the desired task. The proposed method is better in terms of a solution quality and computational complexity as compared to a state-of-the-art learning based method for manipulation of dynamical systems governed by partial differential equations. Furthermore our proposed method is competitive with a classical semi-analytical method in acoustics research on the demonstrated tasks. We have made the project code publicly available, along with a web page featuring video demonstrations: https://gladisor.github.io/waves/

11:35-11:40, Paper TuBT16.5	Add to My Program
Integrating Model-Based Control and RL for Sim2Real Transfer of Tight Insertion Policies

Marougkas, Isidoros	Rutgers University
Metha Ramesh, Dhruv	Rutgers University
Doerr, Joe	Rutgers University
Granados, Edgar	Rutgers University
Sivaramakrishnan, Aravind	Amazon Fulfillment Technology & Robotics
Boularias, Abdeslam	Rutgers University
Bekris, Kostas E.	Rutgers, the State University of New Jersey
Keywords: Integrated Planning and Learning, Reinforcement Learning, Manipulation Planning Abstract: Object insertion under tight tolerances (< 1mm) is an important but challenging assembly task as even small errors can result in undesirable contacts. Recent efforts focused on Reinforcement Learning (RL) often depends on careful definition of dense reward functions. This work proposes an effective strategy for such tasks that integrates traditional model-based control with RL to achieve improved insertion accuracy. The policy is trained exclusively in simulation and is zero-shot transferred to the real system. It employs a potential field-based controller to acquire a model-based policy for inserting a plug into a socket given full observability in simulation. This policy is then integrated with residual RL, which is trained in simulation given only a sparse, goal-reaching reward. A curriculum scheme over observation noise and action magnitude is used for training the residual RL policy. Both policy components use as input the SE(3) poses of both the plug and the socket and return the plug’s SE(3) pose transform, which is executed by a robotic arm using a controller. The integrated policy is deployed on the real system without further training or fine-tuning, given a visual SE(3) object tracker. The proposed solution and alternatives are evaluated across a variety of objects and conditions in simulation and reality. The proposed approach outperforms recent RL-based methods in this domain and prior efforts with hybrid policies. Ablations highlight the impact of each component of the approach. For more information please refer to the corresponding website.

11:40-11:45, Paper TuBT16.6	Add to My Program
Generative Graphical Inverse Kinematics

Limoyo, Oliver	University of Toronto
Maric, Filip	University of Toronto Institute for Aerospace Studies
Giamou, Matthew	McMaster University
Alexson, Petra	University of Toronto
Petrovic, Ivan	University of Zagreb
Kelly, Jonathan	University of Toronto
Keywords: Deep Learning in Robotics and Automation, Kinematics, Manipulation Planning, Redundant Robots Abstract: Quickly and reliably finding accurate inverse kinematics (IK) solutions remains a challenging problem for many robot manipulators. Existing numerical solvers are broadly applicable but typically only produce a single solution and rely on local search techniques to minimize nonconvex objective functions. More recent learning-based approaches that approximate the entire feasible set of solutions have shown promise as a means to generate multiple fast and accurate IK results in parallel. However, existing learning-based techniques have a significant drawback: each robot of interest requires a specialized model that must be trained from scratch. To address this key shortcoming, we propose a novel distance-geometric robot representation coupled with a graph structure that allows us to leverage the sample efficiency of Euclidean equivariant functions and the generalizability of graph neural networks (GNNs). Our approach is generative graphical inverse kinematics (GGIK), the first learned IK solver able to accurately and efficiently produce a large number of diverse solutions in parallel while also displaying the ability to generalize—a single learned model can be used to produce IK solutions for a variety of different robots. When compared to several other learned IK methods, GGIK provides more accurate solutions with the same amount of data. GGIK can generalize reasonably well to robot manipulators unseen during training. Additionally, GGIK can learn a constrained distribution that encodes joint limits and scales efficiently to larger robots and a high number of sampled solutions. Finally, GGIK can be used to complement local IK solvers by providing reliable initializations for a local optimization process.


TuBT17 Regular Session, 405	Add to My Program
Localization 1

Chair: Urbann, Oliver	Fraunhofer IML
Co-Chair: Dümbgen, Frederike	ENS, PSL University

11:15-11:20, Paper TuBT17.1	Add to My Program
GNSS/Multi-Sensor Fusion Using Continuous-Time Factor Graph Optimization for Robust Localization

Zhang, Haoming	RWTH Aachen University
Chen, Chih-Chun	RWTH Aachen University
Vallery, Heike	TU Delft
Barfoot, Timothy	University of Toronto
Keywords: Sensor Fusion, Localization, Autonomous Vehicle Navigation, Factor Graph Optimization Abstract: Accurate and robust vehicle localization in highly urbanized areas is challenging. Sensors are often corrupted in those complicated and large-scale environments. This paper introduces GNSS-FGO, an online trajectory estimator that fuses GNSS observations alongside multiple sensor measurements for robust vehicle localization. In GNSS-FGO, we fuse asynchronous sensor measurements into the graph with a continuous-time trajectory representation. This enables querying states at arbitrary timestamps so that sensor observations are fused without strict state and measurement synchronization. We employed datasets from measurement campaigns in Aachen, Düsseldorf, and Cologne in experimental studies and presented comprehensive discussions on sensor observations, smoother types, and hyperparameter tuning. Our results show that the proposed approach enables robust trajectory estimation in dense urban areas, where the classic multi-sensor fusion method fails. In a test sequence containing a 17km route through Aachen, our method results in a mean 2-D positioning error of 0.48m while fusing raw GNSS observations with lidar odometry in tight coupling.

11:20-11:25, Paper TuBT17.2	Add to My Program
Equivariant Filter for Tightly Coupled LiDAR-Inertial Odometry

Tao, Anbo	Wuhan University
Luo, Yarong	Wuhan University
Xia, Chunxi	Wuhan University
Guo, Chi	Wuhan University
Li, Xingxing	Wuhan University
Keywords: Localization, SLAM Abstract: Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient solution of pose estimation, we propose Eq-LIO, a robust state estimator for tightly coupled LIO systems based on an equivariant filter (EqF). Compared with the invariant Kalman filter based on the SE_2(3) group structure, the EqF uses the symmetry of the semi-direct product group to couple the system state including IMU bias, navigation state, and LiDAR extrinsic calibration state, thereby suppressing linearization error and improving the behavior of the estimator in the event of unexpected state changes. The proposed Eq-LIO owns natural consistency and higher robustness, which is theoretically proven with mathematical derivation and experimentally verified through a series of tests on both public and private datasets.

11:25-11:30, Paper TuBT17.3	Add to My Program
Monocular Visual Place Recognition in LiDAR Maps Via Cross-Modal State Space Model and Multi-View Matching

Yao, Gongxin	Zhejiang University
Li, Xinyang	Zhejiang University
Fu, Luowei	Zhejiang University
Pan, Yu	Zhejiang University
Keywords: Localization, Deep Learning for Visual Perception, Recognition Abstract: Achieving monocular camera localization within pre-built LiDAR maps can bypass the simultaneous mapping process of visual SLAM systems, potentially reducing the computational overhead of autonomous localization. To this end, one of the key challenges is cross-modal place recognition, which involves retrieving 3D scenes (point clouds) from a LiDAR map according to online RGB images. In this paper, we introduce an efficient framework to learn descriptors for both RGB images and point clouds. It takes visual state space model (VMamba) as the backbone and employs a pixel-view-scene joint training strategy for cross-modal contrastive learning. To address the field-of-view differences, independent descriptors are generated from multiple evenly distributed viewpoints for point clouds. A visible 3D points overlap strategy is then designed to quantify the similarity between point cloud views and RGB images for multi-view supervision. Additionally, when generating descriptors from pixel-level features using NetVLAD, we compensate for the loss of geometric information, and introduce an efficient scheme for multi-view generation. Experimental results on the KITTI and KITTI-360 datasets demonstrate the effectiveness and generalization of our method. The code is available at https://github.com/y2w-oc/I2P-CMPR.

11:30-11:35, Paper TuBT17.4	Add to My Program
Learning IMU Bias with Diffusion Model

Zhou, Shenghao	University of Delaware
Katragadda, Saimouli	University of Delaware
Huang, Guoquan (Paul)	University of Delaware
Keywords: Visual-Inertial SLAM, SLAM, Localization Abstract: Motion sensing and tracking with IMU data is essential for spatial intelligence, which however is challenging due to the presence of time-varying stochastic bias. IMU bias is affected by various factors such as temperature and vibration, making it highly complex and difficult to model analytically. Recent data-driven approaches using deep learning have shown promise in predicting bias from IMU readings. However, these methods often treat the task as a regression problem, overlooking the stochatic nature of bias. In contrast, we model bias, conditioned on IMU readings, as a probabilistic distribution and design a conditional diffusion model to approximate this distribution. Through this approach, we achieve improved performance and make predictions that align more closely with the known behavior of bias.

11:35-11:40, Paper TuBT17.5	Add to My Program
On Semidefinite Relaxations for Matrix-Weighted State-Estimation Problems in Robotics

Holmes, Connor	University of Toronto
Dümbgen, Frederike	ENS, PSL University
Barfoot, Timothy	University of Toronto
Keywords: SLAM, Localization, Optimization and Optimal Control, Certifiable Abstract: In recent years, there has been remarkable progress in the development of so-called certifiable perception methods, which leverage semidefinite, convex relaxations to find global optima of perception problems in robotics. However, many of these relaxations rely on simplifying assumptions that facilitate the problem formulation, such as an isotropic measurement noise distribution. In this paper, we explore the tightness of the semidefinite relaxations of matrix-weighted (anisotropic) state-estimation problems and reveal the limitations lurking therein: matrix-weighted factors can cause convex relaxations to lose tightness. In particular, we show that the semidefinite relaxations of localization problems with matrix weights may be tight only for low noise levels. To better understand this issue, we introduce a theoretical connection between the posterior uncertainty of the state estimate and the certificate matrix obtained via convex relaxation. With this connection in mind, we empirically explore the factors that contribute to this loss of tightness and demonstrate that redundant constraints can be used to regain it. As a second technical contribution of this paper, we show

11:40-11:45, Paper TuBT17.6	Add to My Program
Drift-Free Visual SLAM Using Digital Twins

Merat, Roxane	ETH Zurich
Cioffi, Giovanni	University of Zurich
Bauersfeld, Leonard	University of Zurich (UZH)
Scaramuzza, Davide	University of Zurich
Keywords: SLAM, Localization Abstract: Globally-consistent localization in urban environments is crucial for autonomous systems such as self-driving vehicles and drones, as well as assistive technologies for visually impaired people. Traditional Visual-Inertial Odometry (VIO) and Visual Simultaneous Localization and Mapping (VSLAM) methods, though adequate for local pose estimation, suffer from drift in the long term due to reliance on local sensor data. While GPS counteracts this drift, it is unavailable indoors and often unreliable in urban areas. An alternative is to localize the camera to an existing 3D map using visual-feature matching. This can provide centimeter-level accurate localization but is limited by the visual similarities between the current view and the map. This paper introduces a novel approach that achieves accurate and globally-consistent localization by aligning the sparse 3D point cloud generated by the VIO/VSLAM system to a digital twin using point-to-plane matching; no visual data association is needed. The proposed method provides a 6-DoF global measurement tightly integrated into the VIO/VSLAM system. Experiments run on a high-fidelity GPS simulator and real-world data collected from a drone demonstrate that our approach outperforms state-of-the-art VIO-GPS systems and offers superior robustness against viewpoint changes compared to the state-of-the-art Visual SLAM systems.


TuBT18 Regular Session, 406	Add to My Program
Place Recognition 1

Chair: Bogoslavskyi, Igor	Robotics and AI Institute
Co-Chair: Malone, Connor	Queensland University of Technology

11:15-11:20, Paper TuBT18.1	Add to My Program
TDFANet: Encoding Sequential 4D Radar Point Clouds Using Trajectory-Guided Deformable Feature Aggregation for Place Recognition

Lu, Shouyi	Tongji University
Zhuo, Guirong	Tongji University, Shanghai
Wang, Haitao	The Shanghai Geometrical Perception and Learning Co., Ltd
Zhou, Quan	Tongji University
Zhou, Huanyu	Tongji University
Huang, Renbo	Tongji University
Huang, Minqing	Tongji University
Zheng, Lianqing	TONGJI University
Shu, Qiang	The Shanghai Tongyu Automotive Technology Co., Ltd
Keywords: Localization, Autonomous Vehicle Navigation Abstract: Place recognition is essential for achieving closed-loop or global positioning in autonomous vehicles and mobile robots. Despite recent advancements in place recognition using 2D cameras or 3D LiDAR, it remains to be seen how to use 4D radar for place recognition - an increasingly popular sensor for its robustness against adverse weather and lighting conditions. Compared to LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution, which hampers their ability to effectively represent scenes, posing significant challenges for 4D radar-based place recognition. This work addresses these challenges by leveraging multi-modal information from sequential 4D radar scans and effectively extracting and aggregating spatio-temporal features. Our approach follows a principled pipeline that comprises (1) dynamic points removal and ego-velocity estimation from velocity property, (2) bird's eye view (BEV) feature encoding on the refined point cloud, (3) feature alignment using BEV feature map motion trajectory calculated by ego-velocity, (4) multi-scale spatio-temporal features of the aligned BEV feature maps are extracted and aggregated. Real-world experimental results validate the feasibility of the proposed method and demonstrate its robustness in handling dynamic environments. Source codes are available.

11:20-11:25, Paper TuBT18.2	Add to My Program
HeLiOS: Heterogeneous LiDAR Place Recognition Via Overlap-Based Learning and Local Spherical Transformer

Jung, Minwoo	Seoul National University
Jung, Sangwoo	Seoul National University
Gil, Hyeonjae	SNU
Kim, Ayoung	Seoul National University
Keywords: Localization, Range Sensing, SLAM Abstract: LiDAR place recognition is a crucial module in localization that matches the current location with previously observed environments. Most existing approaches in LiDAR place recognition dominantly focus on the spinning type LiDAR to exploit its large FOV for matching. However, with the recent emergence of various LiDAR types, the importance of matching data across different LiDAR types has grown significantly—a challenge that has been largely overlooked for many years. To address these challenges, we introduce HeLiOS, a deep network tailored for heterogeneous LiDAR place recognition, which utilizes small local windows with spherical transformers and optimal transport-based cluster assignment for robust global descriptors. Our overlap-based data mining and guided-triplet loss overcome the limitations of traditional distance-based mining and discrete class constraints. HeLiOS is validated on public datasets, demonstrating performance in heterogeneous LiDAR place recognition while including an evaluation for long-term recognition, showcasing its ability to handle unseen LiDAR types. We release the HeLiOS code as an open source for the robotics community at https://github.com/minwoo0611/HeLiOS.

11:25-11:30, Paper TuBT18.3	Add to My Program
InsCMPR: Efficient Cross-Modal Place Recognition Via Instance-Aware Hybrid Mamba-Transformer

Jiao, Shuaifeng	National University of Defense Technology
Su, Zhuoqun	National University of Defense Technology
Luo, Lun	Zhejiang University
Yu, Hongshan	Hunan University
Zhou, Zongtan	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Keywords: Localization, Deep Learning for Visual Perception, Visual Learning Abstract: Place recognition is an important technique for autonomous mobile robotic applications. While single-modal sensor-based approaches have shown satisfactory performance, cross-modal place recognition remains underexplored due to the challenge of bridging the cross-modal heterogeneity gap. In this work, we introduce an instance-aware cross-modal place recognition approach, named InsCMPR. We design a novel instance-aware modality alignment module, which aligns multi-modal data at both pixel-level and instance-level by leveraging a pre-trained vision foundation model SAM. Then a novel dual-branch hybrid Mamba-Transformer network is proposed to efficiently enhance the distinctiveness of the produced descriptors by integrating global features with local instance features. Experimental results on the KITTI, NCLT, and HAOMO datasets show that our proposed methods achieve state-of-the-art performance while operating in real time. We will open source the implementation of our method at: https://github.com/nubot-nudt/InsCMPR.

11:30-11:35, Paper TuBT18.4	Add to My Program
Adaptive Thresholding for Sequence-Based Place Recognition

Vysotska, Olga	ETH Zurich
Bogoslavskyi, Igor	Magic Leap
Hutter, Marco	ETH Zurich
Stachniss, Cyrill	University of Bonn
Keywords: SLAM, Localization, Mapping Abstract: Robots need to know where they are in the world to operate effectively without human support. One common first step for precise robot localization is visual place recognition. It is a challenging problem, especially when the output is required in an online fashion, and the current state-of-the-art approaches that tackle it usually require either large amounts of labeled training data or rely on parameters that need to be tuned manually, often per dataset. One such parameter often used for sequence-based place recognition is the image similarity threshold that allows to differentiate between pairs of images that represent the same place even in the presence of severe environmental and structural changes, and those that represent different places even if they share a similar appearance. Currently, selecting this threshold is a manual procedure and requires human expertise. We propose an automatic similarity threshold selection technique and integrate it into a complete sequence-based place recognition system. The experiments on a broad range of real-world and simulated data show that our approach is capable of matching image sequences under various illumination, viewpoint and underlying structural changes, runs online, and requires no manual parameter tuning while yielding performance comparable to a manual, dataset-specific parameter tuning. Thus, this paper substantially increases the ease of use of visual place recognition in real-world settings.

11:35-11:40, Paper TuBT18.5	Add to My Program
RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition

Park, Yechan	Yonsei University
Pak, Gyuhyeon	Yonsei University
Kim, Euntai	Yonsei University
Keywords: SLAM, Localization, Mapping Abstract: While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) keypoint extraction method, (2) key instance segmentation method, (3) RE-TRIP matching method, and (4) reflectivity combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios—including long corridors, bridges, large-scale urban areas, and highly dynamic environments—our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context and STD. Our code is available at : https://github.com/pyc5714/RE-TRIP.

11:40-11:45, Paper TuBT18.6	Add to My Program
Context Graph-Based Visual-Language Place Recognition

Woo, Soojin	Seoul National University
Kim, Seong-Woo	Seoul National University
Keywords: Localization, AI-Enabled Robotics, Object Detection, Segmentation and Categorization Abstract: In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features remains challenging in scenarios with changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data for model training, presenting a significant limitation. To handle variations in appearance, methods that leverage high-level semantic information, such as objects or categories, have been introduced. In this paper, we introduce a novel VPR approach that does not require additional training and remains robust to scene changes. Our method constructs semantic image descriptors by extracting pixel-level embeddings from a zero-shot, languagedriven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and offthe- shelf convolutional neural network (CNN) descriptors. Our code is available at https://github.com/woo-soojin/ context-based-vlpr.git.


TuBT19 Regular Session, 407	Add to My Program
Tactile Sensing 1

Chair: Chin, Lillian	UT Austin
Co-Chair: Li, Monica	Yale University

11:15-11:20, Paper TuBT19.1	Add to My Program
Marker or Markerless? Mode-Switchable Optical Tactile Sensing for Diverse Robot Tasks

Ou, Ni	Beijing Institute of Technology
Chen, Zhuo	King's College London
Luo, Shan	King's College London
Keywords: Force and Tactile Sensing, Grasping Abstract: Optical tactile sensors play a pivotal role in robot perception and manipulation tasks. The membrane of these sensors can be painted with markers or remain markerless, enabling them to function in either marker or markerless mode. However, this uni-modal selection means the sensor is only suitable for either manipulation or perception tasks. While markers are vital for manipulation, they can also obstruct the camera, thereby impeding perception. The dilemma of selecting between marker and markerless modes presents a significant obstacle. To address this issue, we propose a novel mode-switchable optical tactile sensing approach that facilitates transitions between the two modes. The marker-to-markerless transition is achieved through a generative model, whereas its inverse transition is realized using a sparsely supervised regressive model. Our approach allows a single-mode optical sensor to operate effectively in both marker and markerless modes without the need for additional hardware, making it well-suited for both perception and manipulation tasks. Extensive experiments validate the effectiveness of our method. For perception tasks, our approach decreases the number of categories that include misclassified samples by 2 and improves contact area segmentation IoU by 3.53%. For manipulation tasks, our method attains a high success rate of 92.59% in slip detection. Code, dataset and demo videos are available at the project website https://gitouni.github.io/Marker-Markerless-Transition/

11:20-11:25, Paper TuBT19.2	Add to My Program
Self-Mixing Laser Interferometry for Robotic Tactile Sensing

Proesmans, Remko	Ghent University
Ward, Goossens	Ghent University
Van den Stockt, Lowiek	OTIV
Christiaen, Lowie	Ugent
Wyffels, Francis	Ghent University
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Embedded Systems for Robotic and Automation Abstract: Self-mixing interferometry (SMI) has been lauded for its sensitivity in detecting microvibrations, while requiring no physical contact with its target. In robotics, microvibrations have traditionally been interpreted as a marker for object slip, and recently as a salient indicator of extrinsic contact. We present the first-ever robotic fingertip making use of SMI for slip and extrinsic contact sensing. The design is validated through measurement of controlled vibration sources, both before and after encasing the readout circuit in its fingertip package. Then, the SMI fingertip is compared to acoustic sensing through four experiments. The results are distilled into a technology decision map. SMI was found to be more sensitive to subtle slip events and significantly more resilient against ambient noise. We conclude that the integration of SMI in robotic fingertips offers a new, promising branch of tactile sensing in robotics. Design and data files are available at https://github.com/RemkoPr/icra2025-SMI-tactile-sensing.

11:25-11:30, Paper TuBT19.3	Add to My Program
Estimating High-Resolution Neural Stiffness Fields Using Visuotactile Sensors

Han, Jiaheng	University of Illinois Urbana-Champaign
Yao, Shaoxiong	University of Illinois Urbana-Champaign
Hauser, Kris	University of Illinois at Urbana-Champaign
Keywords: Force and Tactile Sensing Abstract: High-resolution visuotactile sensors provide detailed contact information that is promising to infer the physical properties of objects in contact. This paper introduces a novel technique for high-resolution stiffness estimation of heterogeneous deformable objects using the Punyo bubble sensor. We developed an observation model for dense contact forces to estimate object stiffness using a visuotactile sensor and a dense force estimator. Additionally, we propose a neural volumetric stiffness field (VSF) formulation that represents stiffness as a continuous function, which allows dynamic point sampling at visuotactile sensor observation resolution. The neural VSF significantly reduces artifacts commonly associated with traditional point-based methods, particularly in stiff inclusion estimation and heterogeneous stiffness estimation. We further apply our method in a blind localization task, where objects within opaque bags are accurately modeled and localized, demonstrating the superior performance of neural VSF compared to existing techniques.

11:30-11:35, Paper TuBT19.4	Add to My Program
High-Resolution Reconstruction of Non-Planar Tactile Patterns from Low-Resolution Taxel-Based Tactile Sensors

Zhou, Chen	The University of Hong Kong
Zhao, He	Dalian University of Technology
Liu, Qian	Dalian University of Technology
Keywords: Force and Tactile Sensing, Contact Modeling Abstract: Over the past decades, the development of tactile sensors has gained increasing attention and has gradually become a fundamental device for robots. Especially in today's context where human-robot interaction demands are growing and the requirements for tactile perception are becoming stricter, how to enable robots to better perceive their environment has become a topic worth discussing. Tactile sensors, after years of development, have emerged in two main types: taxel-based and vision-based sensors, where the latter can provide relatively low resolution (LR) tactile patterns compared with the former. Both of them have seen significant enhancements in their tactile perception capabilities on flat and regular surfaces. However, as application scenarios expand, current flat tactile perception can no longer meet the robots' needs for multi-dimensional and complex perception capabilities. Therefore, we investigate the high-resolution (HR) reconstruction of non-planar tactile patterns captured by LR taxel-based sensors in this paper. We first develop a new dataset, where the ground truth of non-planar tactile patterns are obtained with a vision-based GelSight Mini tactile sensor, and the LR data are collected via a commercial taxel-based Xela sensor. In addition, we propose to adapt the state-of-the-art CNN- and GAN-based tactile super-resolution model of flat/planar surfaces to the non-planar scenario, and also develop a diffusion-based model for the non-planar HR reconstruction. Experimental results confirm the efficiency of the proposed models.

11:35-11:40, Paper TuBT19.5	Add to My Program
Blind Tactile Exploration for Surface Reconstruction

Sinha, Yashaswi	Indian Institute of Science, Bengaluru
Bhattacharya, Soumojit	Indian Institute of Technology Kharagpur
Sahu, Yash Kumar	Indian Institute of Science, Bengaluru
Biswas, Pradipta	Indian Institute of Science
Keywords: Force and Tactile Sensing, Manipulation Planning Abstract: Accurate 3d reconstruction capturing the fine details of an object’s shape is essential for tasks such as automated assembly, inspection, and quality control. While monocular cameras provide broad visual structure but often miss critical surface details and depth accuracy in underexposed or occluded environments. Tactile sensors offer precise, localized depth information, capturing fine textures, yet exploring varied curvature surfaces with only tactile input remains challenging. To address this, the paper proposes a blind surface exploration method for convex objects using a set of sequential controllers to efficiently guide the manipulator's interaction with surfaces featuring sharp edge changes. This approach ensures precise tactile exploration, leading to highly detailed surface reconstruction. With the controller employed, the algorithm was able to move along the surface while maintaining contact along normal and reconstruct the object with IoU as high as 91% for objects with sharp edges.

11:40-11:45, Paper TuBT19.6	Add to My Program
Graph-Structured Super-Resolution for Geometry-Generalized Tomographic Tactile Sensing: Application to Humanoid Faces

Park, Hyunkyu	Samsung Advanced Institute of Technology
Kim, Woojong	KAIST
Jeon, Sangha	Korea Advanced Institute of Science and Technology(KAIST)
Na, Youngjin	Sookmyung Women's University
Kim, Jung	KAIST
Keywords: Force and Tactile Sensing, Deep Learning in Robotics and Automation, Sensor-based Control, Tomographic Reconstruction Abstract: Electrical impedance tomographic (EIT) tactile sensing holds great promise for whole-body coverage of contact-rich robotic systems, offering extensive flexibility in sensor geometry. However, low spatial resolution restricts its practical use, despite the existing deep-learning-based reconstruction methods. This study introduces EIT-GNN, a graph-structured data-driven EIT reconstruction framework that achieves super-resolution in large-area tactile perception on unbounded form factors of robots. EIT-GNN represents the arbitrary sensor shape into mesh connections, then employs a two-fold architecture of transformer encoder and graph convolutional neural network to best manage such the geometrical prior knowledge, resulting in the accurate, generalized, and parameter-efficient reconstruction procedure. As a proof-of-concept, we demonstrate its application using large-area face-shaped sensor hardware, which represents one of the most complex geometries in human/humanoid anatomy. An extensive set of experiments, including simulation study, ablation analysis, single-touch indentation test, and latent feature analysis, confirm its superiority over alternative models. The beneficial features of the approach are demonstrated through its application in active tactile-servo control of humanoid head motion, paving the new way for integrating tactile sensors with intricate designs into robotic systems.


TuBT20 Regular Session, 408	Add to My Program
Design and Robust Control

Chair: Isler, Volkan	University of Minnesota
Co-Chair: Tan, Xiaobo	Michigan State University

11:15-11:20, Paper TuBT20.1	Add to My Program
Control Reallocation Using Deep Reinforcement Learning for Actuator Fault Recovery of an Autonomous Underwater Vehicle

Lagattu, Katell	ENSTA
Artusi, Eva	Naval Group
Santos, Paulo E.	Priori Analytica
Sammut, Karl	Flinders University
Le Chenadec, Gilles	ENSTA
Clement, Benoit	ENSTA, Institut Polytechnique De Paris
Keywords: Robust/Adaptive Control, Reinforcement Learning, Autonomous Agents Abstract: Actuator faults in dynamic systems pose significant challenges, particularly for robotic systems operating in hostile environments such as Autonomous Underwater Vehicles (AUVs), risking loss of stability and performance degradation. Fault Tolerant Control (FTC) strategies, including Control Reallocation (CR), have been developed to mitigate such risks. However, these strategies extensively depend on explicit fault diagnosis, which may present challenges regarding computational demands and efficiency, particularly when dealing with unknown faults. This paper presents a novel method that performs CR with Deep Reinforcement Learning (DRL) for actuator fault recovery without explicit fault diagnosis. The approach is implemented on a BlueROV2 underwater vehicle and demonstrates improved performance for fault recovery compared to a standard Proportional-Integral-Derivative (PID) controller and a variable gain PID controller, both in simulation and in real-world conditions. The DRL-based CR method demonstrates generalisability by successfully handling faults not encountered during training, highlighting its adaptability to unforeseen circumstances.

11:20-11:25, Paper TuBT20.2	Add to My Program
A New Framework for Repetitive Control of Robot Manipulators Via Operator-Theoretic Robust Stabilization

Song, Geun Il	Postech
Kim, Jung Hoon	Pohang University of Science and Technology
Keywords: Robust/Adaptive Control, Motion Control, Industrial Robots Abstract: This paper establishes a new framework for repetitive control of uncertain robot manipulators via operator-theoretic robust stabilization. After applying the inverse dynamics approach to robot manipulators, by which the relevant nonlinear inputslash output behavior is converted to a linear time-invariant (LTI) equation, we take the repetitive control approach. Even though such a repetitive controller is known to achieve high performances for periodic reference inputs, it is quite difficult to derive the stability analysis for the resulting closed-loop systems in a rigorous fashion. To solve this difficulty, we construct an operator-theoretic approach to the repetitive control treatment, and show that the closed-loop systems are exponentially stable if and only if the spectrum radius of the relevant monodromy operator is less than 1. Based on the necessary and sufficient condition, we develop a guideline to take the relevant control parameters. Finally, some experiment results are given to demonstrate the overall arguments developed in this paper.

11:25-11:30, Paper TuBT20.3	Add to My Program
Learning Robust Policies Via Interpretable Hamilton-Jacobi Reachability-Guided Disturbances

Hu, Hanyang	Simon Fraser University
Zhang, Xilun	Carnegie Mellon University
Lyu, Xubo	Simon Fraser University
Chen, Mo	Simon Fraser University
Keywords: Robust/Adaptive Control, Reinforcement Learning Abstract: Deep Reinforcement Learning (RL) has shown remarkable success in robotics with complex and heterogeneous dynamics. However, its vulnerability to unknown disturbances and adversarial attacks remains a significant challenge. In this paper, we propose a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries. Our approach introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training, where we use interpretable worst-case or near-worst-case disturbances as adversaries against the robust policy. We evaluated its effectiveness across three distinct tasks: a reach-avoid game in both simulation and real-world settings, and a highly dynamic quadrotor stabilization task in simulation. We validate that our learned critic network is consistent with the ground-truth HJ value function, while the policy network shows comparable performance with other learning-based methods.

11:30-11:35, Paper TuBT20.4	Add to My Program
Optimal Fault-Tolerant Control for Tugboats Robust Path Following in Nearshore

Shi, Jiangteng	Hainan University
Zhang, Jun	Graz University of Technology
Chen, Yujing	HaiNan University
Ren, Jia	Hainan University
Keywords: Robust/Adaptive Control, Marine Robotics, Reinforcement Learning Abstract: External ocean disturbances (EODs) and internal thruster loss-of-effectiveness faults (ITLEFs) are key factors influencing the accuracy of the autonomous tugboat's path following, as well as the stability and safety of the tugboat's hull during maritime operations. To achieve robust path following for the autonomous tugboat, this paper proposes an optimal fault-tolerant control scheme. Firstly, we formulate the robust path following of the tugboat as an optimal fault-tolerance control problem. A matrixed error system for the control scheme is constructed to uniformly consider both EODs and ITLEFs. Secondly, considering the time and economic costs associated with algorithm deployment and tuning process on tugboats in real world, we present an adaptive dynamic programming algorithm to solve the proposed optimal fault-tolerance problem, which is characterized by ease of tuning. Then, the stability of the control system is proven based on the Lyapunov criterion. Finally, the proposed control scheme is evaluated under practical conditions with EODs and ITLEFs. The comparative results with backstepping-based control scheme demonstrate that the proposed control scheme exhibits more robustness for path following under EODs and ITLEFs.

11:35-11:40, Paper TuBT20.5	Add to My Program
Neural L1 Adaptive Control of Vehicle Lateral Dynamics

Mukherjee, Pratik	Florida Atlantic University
Gonultas, Burak M	University of Minnesota
Poyrazoglu, Oguzhan Goktug	University of Minnesota
Isler, Volkan	University of Minnesota
Keywords: Robust/Adaptive Control, Machine Learning for Robot Control, Autonomous Vehicle Navigation Abstract: We address the problem of stable and robust control of vehicles with lateral error dynamics for the application of lane keeping. Lane departure is the primary reason for half of the fatalities in road accidents, making the development of stable, adaptive and robust controllers a necessity. Traditional linear feedback controllers achieve satisfactory tracking performance, however, they exhibit unstable behavior when uncertainties are induced into the system. Any disturbance or uncertainty introduced to the steering-angle input can be catastrophic for the vehicle. Therefore, controllers must be developed to actively handle such uncertainties. In this work, we introduce a Neural L1 Adaptive controller which learns the uncertainties in the lateral error dynamics of a front-steered Ackermann vehicle and guarantees stability and robustness. Our contributions are threefold: i) We extend the theoretical results for guaranteed stability and robustness of conventional L1 Adaptive controllers to Neural L1 Adaptive controller; ii) We implement a Neural L1 Adaptive controller for the lane keeping application which learns uncertainties in the dynamics accurately; iii) We evaluate the performance of Neural L1 Adaptive controller on a physics-based simulator, PyBullet, and conduct extensive real-world experiments with the F1TENTH platform to demonstrate superior reference trajectory tracking performance of Neural L1 Adaptive controller compared to other state-of-the-art controllers, in the presence of uncertainties.

11:40-11:45, Paper TuBT20.6	Add to My Program
Mechanical Design and Data-Enabled Predictive Control of a Planar Soft Robot

Wang, Huanqing	Michigan State University
Zhang, Kaixiang	Michigan State University
Lee, Kyungjoon	University of California Riverside
Mei, Yu	Michigan State University
Zhu, Keyi	Michigan State University
Srivastava, Vaibhav	Michigan State University
Sheng, Jun	University of California Riverside
Li, Zhaojian	Michigan State University
Keywords: Modeling, Control, and Learning for Soft Robots, Optimization and Optimal Control, Soft Sensors and Actuators Abstract: Soft robots offer a unique combination of flexibility, adaptability, and safety, making them well-suited for a diverse range of applications. However, the inherent complexity of soft robots poses great challenges in their modeling and control. In this paper, we present the mechanical design and data-driven control of a pneumatic-driven soft planar robot. Specifically, we employ a data-enabled predictive control (DeePC) strategy that directly utilizes system input/output data to achieve safe and optimal control, eliminating the need for tedious system identification or modeling. In addition, a dimension reduction technique is introduced into the DeePC framework, resulting in significantly enhanced computational efficiency with minimal to no degradation in control performance. Comparative experiments are conducted to validate the efficacy of DeePC in the control of the fabricated soft robot.


TuBT21 Regular Session, 410	Add to My Program
Reinforcement Learning 2

Chair: Qi, Xinda	Michigan State University
Co-Chair: Jia, Yunyi	Clemson University

11:15-11:20, Paper TuBT21.1	Add to My Program
Efficient Imitation without Demonstrations Via Value-Penalized Auxiliary Control from Examples

Ablett, Trevor	University of Toronto
Chan, Bryan	University of Alberta
Wang, Jayce Haoran	University of Toronto
Kelly, Jonathan	University of Toronto
Keywords: Reinforcement Learning, Imitation Learning, Learning from Demonstration Abstract: Common approaches to providing feedback in reinforcement learning are the use of hand-crafted rewards or full-trajectory expert demonstrations. Alternatively, one can use examples of completed tasks, but such an approach can be extremely sample inefficient. We introduce value-penalized auxiliary control from examples (VPACE), an algorithm that significantly improves exploration in example-based control by adding examples of simple auxiliary tasks and an above-success level value penalty. Across both simulated and real robotic environments, we show that our approach substantially improves learning efficiency for challenging tasks, while maintaining bounded value estimates. Preliminary results also suggest that VPACE may learn more efficiently than the more common approaches of using full trajectories or true sparse rewards. Project site: https://papers.starslab.ca/vpace/.

11:20-11:25, Paper TuBT21.2	Add to My Program
QuasiNav: Asymmetric Cost-Aware Navigation Planning with Constrained Quasimetric Reinforcement Learning

Hossain, Jumman	University of Maryland Baltimore County
Faridee, Abu-Zaher	University of Maryland Baltimore County, USA
Asher, Derrik	DEVCOM Army Research Lab, USA
Freeman, Jade	DEVCOM Army Research Lab, USA
Gregory, Timothy	DEVCOM Army Research Lab, USA
Trout, Theron T.	Stormfish Scientific Corp
Roy, Nirmalya	University of Maryland Baltimore County, USA
Keywords: Reinforcement Learning, Autonomous Vehicle Navigation, Motion and Path Planning Abstract: Autonomous navigation in unstructured outdoor environments is inherently challenging due to the presence of asymmetric traversal costs, such as varying energy expenditures for uphill versus downhill movement. Traditional reinforcement learning methods often assume symmetric costs, which can lead to suboptimal navigation paths and increased safety risks in real-world scenarios. In this paper, we introduce QuasiNav, a novel reinforcement learning framework that integrates quasimetric embeddings to explicitly model asymmetric costs and guide efficient, safe navigation. QuasiNav formulates the navigation problem as a constrained Markov decision process (CMDP) and employs quasimetric embeddings to capture directionally dependent costs, allowing for a more accurate representation of the terrain. We combine this approach with adaptive constraint tightening. This ensures that safety constraints are dynamically enforced during learning. We validate QuasiNav on a Clearpath Jackal robot in three challenging navigation scenarios—undulating terrains, asymmetric hill traversal, and directionally dependent terrain traversal—demonstrating its effectiveness in both simulated and real-world environments. Experimental results show that QuasiNav significantly outperforms conventional methods, achieving higher success rates, improved energy efficiency (13.6% reduction in energy consumption compared to baseline methods), and better adherence to safety constraints.

11:25-11:30, Paper TuBT21.3	Add to My Program
Learning a High-Quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum

Liu, Yihong	Georgia Institute of Technology
Kang, Dongyeop	ETRI
Ha, Sehoon	Georgia Institute of Technology
Keywords: Reinforcement Learning, AI-Enabled Robotics, Force Control Abstract: Autonomous robotic wiping is an important task in various industries, ranging from industrial manufacturing to sanitization in healthcare. Deep reinforcement learning (Deep RL) has emerged as a promising algorithm, however, it often suffers from a high demand for repetitive reward engineering. Instead of relying on manual tuning, we first analyze the convergence of quality-critical robotic wiping, which requires both high-quality wiping and fast task completion, to show the poor convergence of the problem and propose a new bounded reward formulation to make the problem feasible. Then, we further improve the learning process by proposing a novel visual-language model (VLM) based curriculum, which actively monitors the progress and suggests hyperparameter tuning. We demonstrate that the combined method can find a desirable wiping policy on surfaces with various curvatures, frictions, and waypoints, which cannot be learned with the baseline formulation. The demo of this project can be found at: https://sites.google.com/view/highqualitywiping

11:30-11:35, Paper TuBT21.4	Add to My Program
Actor-Critic Cooperative Compensation to Model Predictive Control for Off-Road Autonomous Vehicles under Unknown Dynamics

Gupta, Prakhar	Clemson University
Smereka, Jonathon M.	U.S. Army TARDEC
Jia, Yunyi	Clemson University
Keywords: Machine Learning for Robot Control, Motion Control, Autonomous Vehicle Navigation Abstract: This study presents an Actor-Critic Cooperative Compensated Model Predictive Controller (AC3MPC) designed to address unknown system dynamics. To avoid the difficulty of modeling highly complex dynamics and ensuring real-time control feasibility and performance, this work uses deep reinforcement learning with a model predictive controller in a cooperative framework to handle unknown dynamics. The model-based controller takes on the primary role as both controllers are provided with predictive information about the other. This improves tracking performance and retention of inherent robustness of the model predictive controller. We evaluate this framework for off-road autonomous driving on unknown deformable terrains that represent sandy deformable soil, sandy and rocky soil, and cohesive clay-like deformable soil. Our findings demonstrate that our controller statistically outperforms standalone model-based and learning-based controllers by upto 29.2% and 10.2%. This framework generalized well over varied and previously unseen terrain characteristics to track longitudinal reference speeds with lower errors. Furthermore, this required significantly less training data compared to purely learning-based controller, while delivering better performance even when under-trained.

11:35-11:40, Paper TuBT21.5	Add to My Program
Soft Actor-Critic-Based Control Barrier Adaptation for Robust Autonomous Navigation in Unknown Environments

Mohammad, Nicholas	University of Virginia
Bezzo, Nicola	University of Virginia
Keywords: Machine Learning for Robot Control, Motion and Path Planning, Collision Avoidance Abstract: Motion planning failures during autonomous navigation often occur when safety constraints are either too conservative, leading to deadlocks, or too liberal, resulting in collisions. To improve robustness, a robot must dynamically adapt its safety constraints to ensure it reaches its goal while balancing safety and performance measures. To this end, we propose a Soft Actor-Critic (SAC)-based model for adapting Control Barrier Function (CBF) constraint parameters at runtime, ensuring safe yet non-conservative motion. The proposed approach is designed for a general high-level motion planner, low-level controller, and target system model, and is trained in simulation only. Through extensive simulations and physical experiments, we demonstrate that our framework effectively adapts CBF constraints, enabling the robot to reach its final goal without compromising safety.

11:40-11:45, Paper TuBT21.6	Add to My Program
Multi-Task Reinforcement Learning for Quadrotors

Xing, Jiaxu	University of Zurich
Geles, Ismail	Robotics and Perception Group, University of Zurich
Song, Yunlong	University of Zurich
Aljalbout, Elie	University of Zurich
Scaramuzza, Davide	University of Zurich
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Aerial Systems: Perception and Autonomy Abstract: Reinforcement learning (RL) has shown great effectiveness in quadrotor control, enabling specialized policies to develop outstanding, even human-level, performance in single-task scenarios. However, these specialized policies often struggle with novel tasks, requiring a complete retraining of the policy from scratch. This limitation is particularly challenging in various real-world applications such as search and rescue or infrastructure inspection, where quick and efficient adaptation to diverse tasks is crucial. To address this limitation, we propose a novel multi-task reinforcement learning (MTRL) framework for multiple quadrotor control tasks. Quadrotor control tasks have fundamental similarities based on the consistent physical properties and dynamics of the platform itself. We leverage these similarities and propose an MTRL approach based on an efficient knowledge-sharing framework. Our approach significantly improves the sample efficiency compared to learning tasks individually without compromising task performance. As a result, our approach produces a single high-performance policy capable of executing complex maneuvers such as stabilizing from high speed, velocity tracking, and autonomous racing. Our experimental results, validated both in simulation and real-world scenarios, demonstrate that our framework outperforms baseline approaches in terms of sample efficiency and overall task performance.


TuBT22 Regular Session, 411	Add to My Program
Learning for Navigation

Chair: Song, Daeun	George Mason University
Co-Chair: Kuipers, Benjamin	University of Michigan

11:15-11:20, Paper TuBT22.1	Add to My Program
Watch Your STEPP: Semantic Traversability Estimation Using Pose Projected Features

Aegidius, Sebastian	University College London
Hadjivelichkov, Denis	University College London
Jiao, Jianhao	University College London
Embley-Riches, Jonathan	University College London
Kanoulas, Dimitrios	University College London
Keywords: Vision-Based Navigation, Motion and Path Planning, Legged Robots Abstract: Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP

11:20-11:25, Paper TuBT22.2	Add to My Program
GND: Global Navigation Dataset with Multi-Modal Perception and Multi-Category Traversability in Outdoor Campus Environments

Liang, Jing	University of Maryland
Das, Dibyendu	George Mason University
Song, Daeun	George Mason University
Shuvo, Md Nahid Hasan	George Mason University
Durrani, Mohammad	University of Maryland
Taranath, Karthik	University of Maryland
Penskiy, Ivan	University of Maryland, College Park
Manocha, Dinesh	University of Maryland
Xiao, Xuesu	George Mason University
Keywords: Data Sets for Robot Learning, Motion and Path Planning, Integrated Planning and Learning Abstract: Navigating large-scale outdoor environments, e.g., university campuses, requires complex reasoning in terms of geometric structures, environmental semantics, and terrain characteristics using onboard sensors like LiDARs and cameras. Although existing mobile robots can navigate such environments using pre-defined, high-precision maps based on hand-crafted rules catered for every environment, they lack commonsense reasoning capabilities that most humans possess when navigating unknown outdoor spaces. To equip robots with such capabilities, we propose a large-scale Global Navigation Dataset, GND, which incorporates multi-modal sensory data (3D LiDAR point clouds and RGB and 360textdegree~images) and multi-category traversability maps (pedestrian walkways, vehicle roadways, stairs, off-road terrain, and obstacles) from ten university campuses. We also present a set of novel use cases of GND to showcase its utility to enable global robot navigation. GND's website can be found at https://cs.gmu.edu/~xiao/Research/GND/.

11:25-11:30, Paper TuBT22.3	Add to My Program
VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Elnoor, Mohamed	University of Maryland
Kulathun Mudiyanselage, Kasun Weerakoon	University of Maryland, College Park
Seneviratne, Gershom Devake	University of Maryland, College Park
Xian, Ruiqi	University of Maryland-College Park
Guan, Tianrui	University of Maryland
M Jaffar, Mohamed Khalid	University of Maryland, College Park
Rajagopal, Vignesh	University of Maryland, College Park
Manocha, Dinesh	University of Maryland
Keywords: Vision-Based Navigation, Motion and Path Planning, Perception-Action Coupling Abstract: We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM’s semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot’s real-time physical interactions with the environment. We use the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning. We validate our method on a legged robot (Ghost Vision 60) and a wheeled robot (Clearpath Husky), in diverse real-world outdoor environments with different deformable and slippery terrains. In practice, we observe significant improvements over state-of-the-art methods by up to 50% increase in navigation success rate.

11:30-11:35, Paper TuBT22.4	Add to My Program
TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals

Podgorski, Stefan	University of Adelaide
Garg, Sourav	University of Adelaide
Hosseinzadeh, Mehdi	The Australian Institute for Machine Learning (AIML) -- the Univ
Mares, Lachlan	University of Adelaide
Dayoub, Feras	The University of Adelaide
Reid, Ian	University of Adelaide
Keywords: Learning from Demonstration, Machine Learning for Robot Control, Vision-Based Navigation Abstract: Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level subgoals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory rollout using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: url{https://github.com/podgorki/TANGO}.

11:35-11:40, Paper TuBT22.5	Add to My Program
NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments

Wang, Haitong	University of Toronto
Tan, Aaron Hao	University of Toronto
Nejat, Goldie	University of Toronto
Keywords: Vision-Based Navigation, Search and Rescue Robots, AI-Enabled Robotics Abstract: In unknown cluttered and dynamic environments such as disaster scenes, mobile robots need to perform target-driven navigation in order to find people or objects of interest, where the only information provided about these targets are images of the individual targets. In this paper, we introduce NavFormer, a novel end-to-end transformer architecture developed for robot target-driven navigation in unknown and dynamic environments. NavFormer leverages the strengths of both 1) transformers for sequential data processing and 2) self-supervised learning (SSL) for visual representation to reason about spatial layouts and to perform collision-avoidance in dynamic settings. The architecture uniquely combines dual-visual encoders consisting of a static encoder for extracting invariant environment features for spatial reasoning, and a general encoder for dynamic obstacle avoidance. The primary robot navigation task is decomposed into two sub-tasks for training: single robot exploration and multi-robot collision avoidance. We perform cross-task training to enable the transfer of learned skills to the complex primary navigation task. Simulated experiments demonstrate that NavFormer can effectively navigate a mobile robot in diverse unknown environments, outperforming existing state-of-the-art methods. A comprehensive ablation study is performed to evaluate the impact of the main design choices of NavFormer. Furthermore, real-world experiments validate the generalizability of NavFormer.

11:40-11:45, Paper TuBT22.6	Add to My Program
Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy

Kim, Yunho	Neuromeka
Lee, Jeong Hyun	Korea Advanced Institute of Science & Technology (KAIST)
Lee, Choongin	KAIST
Mun, Juhyeok	Korea Advanced Institute of Science and Technology
Youm, Donghoon	Korea Advanced Institute of Science and Technology
Park, Jeongsoo	KAIST
Hwangbo, Jemin	Korean Advanced Institute of Science and Technology
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: For reliable autonomous robot navigation in urban settings, the robot must have the ability to identify semantically traversable terrains in the image based on the semantic understanding of the scene. This reasoning ability is based on semantic traversability, which is frequently achieved using semantic segmentation models fine-tuned on the testing domain. This fine-tuning process often involves manual data collection with the target robot and annotation by human labelers which is prohibitively expensive and unscalable. In this work, we present an effective methodology for training a semantic traversability estimator using egocentric videos and an automated annotation process. Egocentric videos are collected from a camera mounted on a pedestrian's chest. The dataset for training the semantic traversability estimator is then automatically generated by extracting semantically traversable regions in each video frame using a recent foundation model in image segmentation and its prompting technique. Extensive experiments with videos taken across several countries and cities, covering diverse urban scenarios, demonstrate the high scalability and generalizability of the proposed annotation method. Furthermore, performance analysis and real-world deployment for autonomous robot navigation showcase that the trained semantic traversability estimator is highly accurate, able to handle diverse camera viewpoints, computationally light, and real-world applicable. The summary video is available at https://youtu.be/EUVoH-wA-lA.


TuBT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Navigation 2

Chair: Zhou, Lifeng	Drexel University
Co-Chair: Yang, Yi	Beijing Institute of Technology

11:15-11:20, Paper TuBT23.1	Add to My Program
Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

Chen, Yuxiao	Nvidia Research
Tonkens, Sander	University of California - San Diego
Pavone, Marco	Stanford University
Keywords: Autonomous Vehicle Navigation, AI-Based Methods, Intelligent Transportation Systems Abstract: Adept traffic models are critical to both real-time prediction/planning and closed-loop simulation for autonomous vehicles (AV). Key design objectives include accuracy, diverse multimodal behaviors, interpretability, and compatibility with other modules in the autonomy stack, e.g., the downstream planner. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and categorical predictions with clear semantic meanings (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variables from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different semantic modes while significantly beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables direct integration with semantic-heavy modules such as behavior planners and language models, bridging the tokenized representation and the continuous trajectory space.

11:20-11:25, Paper TuBT23.2	Add to My Program
LACNS: Language-Assisted Continuous Navigation in Structured Spaces

Peng, RuTong	Beijing Institute of Technology
Zhang, Yiqing	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: Current autonomous driving technology typically relies on high-precision (HD) maps to ensure safe, reliable, and accurate navigation in urban environments. While these maps provide essential road information, their creation and maintenance are costly, limiting their widespread application. To mitigate this reliance, we propose a novel system, Language-Assisted Continuous Navigation in Structured Spaces (LACNS). LACNS facilitates autonomous driving without the need for HD maps by integrating vehicle-centric local perception with real-time language commands from map software or human navigators. LACNS begins by generating a BEV map using the vehicle's front-facing camera. Simultaneously, a pre-trained Visual Language Model (VLM) detects intersections from the camera images, assigning a score to each. Road elements are then extracted from the BEV map and combined with the intersection scores to identify potential navigation frontiers. Language instructions, processed by a pre-trained Large Language Model(LLM), are used to select the most suitable frontier. Finally, the chosen frontier and BEV map are employed to plan a safe route and control the vehicle's movement. We evaluated LACNS using the Carla simulator to validate its navigation capabilities in continuous spaces. Initial tests involved navigating through four intersections with varying directional commands, where LACNS demonstrated high and consistent success rates across multiple trials. Further simulations in real-time navigation scenarios revealed that LACNS consistently maintained a high success rate across three progressively challenging routes. These results highlight the effectiveness of our novel HD map-independent autonomous driving navigation method.

11:25-11:30, Paper TuBT23.3	Add to My Program
Decentralized Vehicle Coordination: The Berkeley DeepDrive Drone Dataset and Consensus-Based Models

Wu, Fangyu	UC Berkeley
Wang, Dequan	UC Berkeley
Hwang, Minjune	Stanford University
Hao, Chenhui	UC Berkeley
Lu, Jiawei	UC Berkeley
Zhang, Jiamu	UC Berkeley
Chou, Christopher	UC Berkeley
Darrell, Trevor	UC Berkeley
Bayen, Alexandre	UC Berkeley
Keywords: Autonomous Vehicle Navigation, Collision Avoidance, Distributed Robot Systems Abstract: A significant portion of roads, particularly in densely populated developing countries, lacks explicitly defined right-of-way rules. These understructured roads pose substantial challenges for autonomous vehicle motion planning, where efficient and safe navigation relies on understanding decentralized human coordination for collision avoidance. This coordination, often termed "social driving etiquette," remains underexplored due to limited open-source empirical data and suitable modeling frameworks. In this paper, we present a novel dataset and modeling framework designed to study motion planning in these understructured environments. The dataset includes 20 aerial videos of representative scenarios, an image dataset for training vehicle detection models, and a development kit for vehicle trajectory estimation. We demonstrate that a consensus-based modeling approach can effectively explain the emergence of priority orders observed in our dataset, and is therefore a viable framework for decentralized collision avoidance planning.

11:30-11:35, Paper TuBT23.4	Add to My Program
CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

Suhwan, Choi	Maum.AI
Cho, Yongjun	MaumAI
Kim, Minchan	Seoul National University
Jung, Jaeyoon	MAUM.AI
Joe, Myunchul	MAUM.AI, Inc
Park, Yu Been	MaumAI
Kim, Minseo	Yonsei University
Kim, Sungwoong	Yonsei University
Lee, Sungjae	Yonsei University
Park, Whiseong	Maumai
Chung, Jiwan	Yonsei University
Yu, Youngjae	Yonsei University
Keywords: Autonomous Vehicle Navigation, Human-Robot Collaboration, Imitation Learning Abstract: Real-life robot navigation involves more than just reaching a destination; it requires optimizing movements while addressing scenario-specific goals. An intuitive way for humans to express these goals is through abstract cues like verbal commands or rough sketches. Such human guidance may lack details or be noisy. Nonetheless, we expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they must share a common understanding of basic navigation concepts with humans. To this end, we introduce CANVAS, a novel framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. We present COMMAND, a comprehensive dataset with human-annotated navigation results, spanning over 48 hours and 219 km, designed to train commonsense-aware navigation systems in simulated environments. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments, demonstrating superior performance with noisy instructions. Notably, in the orchard environment, where ROS NavStack records a 0% total success rate, CANVAS achieves a total success rate of 67%. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Furthermore, real-world deployment of CANVAS showcases impressive Sim2Real transfer with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.

11:35-11:40, Paper TuBT23.5	Add to My Program
BETTY Dataset: A Multi-Modal Dataset for Full-Stack Autonomy

Nye, Micah	Carnegie Mellon University
Raji, Ayoub	University of Modena and Reggio Emilia
Saba, Andrew	Carnegie Mellon University
Erlich, Eidan	University of Waterloo
Exley, Robert	University of Pittsburgh
Goyal, Aragya	University of Pittsburgh
Matros, Alexander	University of Waterloo
Misra, Ritesh	University of Pittsburgh
Sivaprakasam, Matthew	Carnegie Mellon University
Marko, Bertogna	Unimore
Ramanan, Deva	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Keywords: Autonomous Vehicle Navigation, Data Sets for Robot Learning, Dynamics Abstract: We present the BETTY dataset, a large-scale, multi-modal dataset collected on several autonomous racing vehicles, targeting supervised and self-supervised state estimation, dynamics modeling, motion forecasting, perception, and more. Existing large-scale datasets, especially autonomous vehicle datasets, focus primarily on supervised perception, planning, and motion forecasting tasks. Our work enables multi-modal, data-driven methods by including all sensor inputs and the outputs from the software stack, along with semantic metadata and ground truth information. The dataset encompasses 4 years of data, currently comprising of over 13 hours and 32 TB, collected on autonomous racing vehicle platforms. This data spans 6 diverse racing environments, including high-speed oval courses, for single and multi-agent algorithm evaluation in feature-sparse scenarios, as well as high-speed road courses with high longitudinal and lateral accelerations and tight, GPS-denied environments. It captures highly dynamic states, such as 63 m/s crashes, loss of tire traction, and operation at the limit of stability. By offering a large breadth of cross-modal and dynamic data, the BETTY dataset enables the training and testing of full autonomy stack pipelines, pushing the performance of all algorithms to the limits. The current dataset is available at https://pitt-mit-iac.github.io/betty-dataset/.

11:40-11:45, Paper TuBT23.6	Add to My Program
LiCS: Navigation Using Learned-Imitation on Cluttered Space

Damanik, Joshua Julian	KAIST
Jung, Jaewon	KAIST
Deresa, Chala Adane	KAIST
Choi, Han-Lim	KAIST
Keywords: Imitation Learning, Constrained Motion Planning, Autonomous Vehicle Navigation Abstract: This work proposes a robust and fast navigation system in a narrow indoor environment for UGV (Unmanned Ground Vehicle) using 2D LiDAR. We used behavior cloning with Transformer neural network to learn the optimization-based baseline algorithm. We inject Gaussian noise during expert demonstration to increase the robustness of the learned policy and evaluate the performance of LiCS using both simulation and hardware experiments. It outperforms all other baselines in terms of navigation performance, achieving a success rate 100% in highly cluttered simulated environments. During the hardware experiments, LiCS can maintain safe navigation at maximum speed of 1.5 m/s.


TuBT24 Regular Session, 401	Add to My Program
Rehabiliation and Ergonomics

Chair: Clark, Janelle	UMBC
Co-Chair: Mimnaugh, Katherine J.	University of Oulu

11:15-11:20, Paper TuBT24.1	Add to My Program
Towards Industry 5.0 - a Neuroergonomic Workstation for Human-Centered Cobot-Supported Manual Assembly Process

Knezevic, Nikola	University of Belgrade - School of Electrical Engineering
Savić, Andrej	University of Belgrade, School of Electrical Engineering
Gordić, Zaviša	University of Belgrade, School of Electrical Engineering
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Jovanovic, Kosta	University of Belgrade
Keywords: Human-Centered Automation, Human-Robot Collaboration, Assembly Abstract: This paper brings the concept of neuroergonomic workcell with its essential components (psychological and physical assessment, non-physical, physical and strategic support) for improving the well-being and productivity of workers at their workplaces. A proof-of-concept neuroergonomic human-centered workstation is demonstrated in a real factory environment for a typical industrial laborious task - assembly. The pilot workstation introduces a fully portable, non-invasive EEG-based users’ mental workload assessment, a non-obtrusive human-machine interface, illustrative graphical assembly guidelines, a cobot assistant, and an intelligent task scheduler. The subjects’ performance and workload were assessed using a NASA-TLX questionnaire, three EEG workload indices, hand gesture detection accuracy, number of errors, and task duration. We identified a notable correlation between multiple EEG indices of workload and NASA score results. The new workstation boosts productivity with better performance and fewer errors on the assembly line while reducing mental demand. Its modular design ensures easy integration and adaptation into factory settings, optimizing manual assembly processes.

11:20-11:25, Paper TuBT24.2	Add to My Program
Remote Extended Reality with Markerless Motion Tracking for Sitting Posture Training

Ai, Xupeng	Columbia University in the City of New York
Agrawal, Sunil	Columbia University
Keywords: Virtual Reality and Interfaces, Human Performance Augmentation, Rehabilitation Robotics Abstract: Dynamic postural control during sitting is essential for functional mobility and daily activities. Extended reality (XR) presents a promising solution for posture training in addressing conventional training limitations related to patient accessibility and ecological validity. We developed a remote XR rehabilitation system with markerless motion tracking for sitting posture training. Forty-two healthy subjects participated in this proof-of-concept pilot study. Each subject completed 24 rounds of multi-directional reach tasks using the system and 24 rounds without it. Motion data were collected via online meetings using built-in camera in the user's laptop. Functional reach test scores were analyzed to assess the impact of the system on motor performance. Four standard questionnaires were used to assess the effects of this system on presence, simulator sickness, engagement, and enjoyment. Our results indicate that the remote XR training system significantly improved functional reach performance and proved highly effective for telerehabilitation. XR interaction also enhanced training engagement and enjoyment. By bridging the spatial gap between patients and therapists, this system enables personalized and engaging home-based intervention. Additionally, it facilitates more natural movements by eliminating body marker constraints and laboratory limitations. This study should serve as a stepping stone to advancing novel remote XR rehabilitation systems.

11:25-11:30, Paper TuBT24.3	Add to My Program
Error-Subspace Transform Kalman Filter Based Real-Time Gait Prediction for Rehabilitation Exoskeletons

Zeng, Haozhou	Zhejiang University
Li, Jiaxing	Zhejiang University
Gu, Yu	Zhejiang University
Yi, Jingang	Rutgers University
Ouyang, Xiaoping	Zhejiang University
Liu, Tao	Zhejiang University
Keywords: Prosthetics and Exoskeletons, Physical Human-Robot Interaction, Rehabilitation Robotics Abstract: With the rapid development of rehabilitation robotics, there is a pressing need for efficient and accurate gait prediction methods. However, due to the complexity and variability of individual gait characteristics and external disturbances, accurately predicting gait in real time remains a significant challenge. This paper proposes an innovative Bayesian-inference-based method for real-time gait prediction while a subject walks with a lower-limb exoskeleton. Periodic gait information is represented using von Mises basis functions, and the weight parameters serve as real-time updated state variables. The error-subspace transform Kalman filter (ESTKF) is applied for gait trajectory prediction. A fully connected neural network (FCNN) is used to estimate the walking speeds in real time based on predicted trajectories. Comparative experiments based on an open-source database prove the advantages of ESKTF compared with other Bayesian filters. Walking experiments are conducted to estimate phase and speed in real time, and to predict the joint angle, total joint torque, and lower-limb muscle surface electromyography (sEMG) values. Experimental results validate the method’s prediction performance across different speeds and demonstrate its resilience to external interference.

11:30-11:35, Paper TuBT24.4	Add to My Program
A Comparative Study of Pulley and Bowden Transmissions in a Novel Cable-Driven Exosuit, the Stillsuit

Jammot, Matthias	ETH Zürich
Esser, Adrian	ETH Zurich
Wolf, Peter	ETH Zurich, Institute of Robotics and Intelligent Systems
Riener, Robert	ETH Zurich
Basla, Chiara	ETH Zurich
Keywords: Prosthetics and Exoskeletons, Tendon/Wire Mechanism, Mechanism Design Abstract: Cable-driven exosuits assist users in ambulatory activities by transmitting assistive torques from motors to the actuated joints. State-of-the-art exosuits typically use Bowden cable transmissions, albeit their limited efficiencies (40–60%) and non-linear response in curved paths. This paper evaluates the efficiency and responsiveness of a new cable-pulley transmission compared to a Bowden transmission, using both steel and Dyneema cables. The analysis includes three experiments: a test bench simulating a curved transmission path, followed by a static and dynamic experiment where six unimpaired participants donned an exosuit featuring both transmissions across the hips and knees. Our findings demonstrate that the pulley transmission consistently outperformed the Bowden’s efficiency by absolute margins of 18.77 ± 7.29% using a steel cable and by 40.60 ± 6.76% using a Dyneema cable across all experiments. Additionally, the steel cable was on average 19.19 ± 5.29% more efficient than the Dyneema cable in the pulley transmission and 41.02 ± 6.34% in the Bowden tube. These results led to the development of the Stillsuit, a novel lower-limb cable-driven exosuit that uses a pulley transmission and steel cable. The Stillsuit sets a new benchmark for exosuits with 87.56 ± 3.92% transmission efficiency, generating similar biological torques to those found in literature (16.4% and 19.0% of the biological knee and hip torques, respectively) while using smaller motors, resulting in a lighter actuation unit (1.92 kg).

11:35-11:40, Paper TuBT24.5	Add to My Program
Rapid Online Learning of Hip Exoskeleton Assistance Preferences

Ramella, Giulia	EPFL
Ijspeert, Auke	EPFL
Bouri, Mohamed	EPFL
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Physically Assistive Devices Abstract: Hip exoskeletons are increasing in popularity due to their effectiveness across various scenarios and their ability to adapt to different users. However, personalizing the assistance often requires lengthy tuning procedures and computationally intensive algorithms, and most existing methods do not incorporate user feedback. In this work, we propose a novel approach for rapidly learning users' preferences for hip exoskeleton assistance. We perform pairwise comparisons of distinct randomly generated assistive profiles, and collect participants preferences through active querying. Users' feedback is integrated into a preference-learning algorithm that updates its belief, learns a user-dependent reward function, and changes the assistive torque profiles accordingly. Results from eight healthy subjects display distinct preferred torque profiles, and users' choices remain consistent when compared to a perturbed profile. A comprehensive evaluation of users' preferences reveals a close relationship with individual walking strategies. The tested torque profiles do not disrupt kinematic joint synergies, and participants favor assistive torques that are synchronized with their movements, resulting in lower negative power from the device. This straightforward approach enables the rapid learning of users preferences and rewards, grounding future studies on reward-based human-exoskeleton interaction.

11:40-11:45, Paper TuBT24.6	Add to My Program
A Human-In-The-Loop Simulation Framework for Evaluating Control Strategies in Gait Assistive Robots

Wang, Yifan	Nanyang Technological University
Chan, Sherwin Stephen	Nanyang Technological University
Lei, Mingyuan	Nanyang Technological University
Lim, Lek Syn	Nanyang Technological University
Johan, Henry	Nanyang Technological University
Zuo, Bingran	Nanyang Technological University
Ang, Wei Tech	Nanyang Technological University
Keywords: Physical Human-Robot Interaction, Human Factors and Human-in-the-Loop, Simulation and Animation Abstract: As the global population ages, effective rehabilitation and mobility aids will become increasingly critical. Gait assistive robots are promising solutions, but designing adaptable controllers for various impairments poses a significant challenge. This paper presents a Human-In-The-Loop (HITL) simulation framework tailored specifically for gait-assistive robots, addressing unique challenges posed by passive support systems. We incorporate a realistic physical human-robot interaction (pHRI) model to enable a quantitative evaluation of control strategies, highlighting the performance of a speed-adaptive controller compared to a conventional PID controller in maintaining compliance and reducing gait distortion. We assess the accuracy of the simulated interactions against that of the real-world data and reveal discrepancies in the adaptation strategies taken and their effect on the human's gait. This work provides valuable insights into optimizing and evaluating system parameters, emphasizing the potential of HITL simulation as a versatile tool for developing and fine-tuning personalized control policies for various users.


TuCT1 Regular Session, 302	Add to My Program
Award Finalists 3

Chair: Vasudevan, Ram	University of Michigan
Co-Chair: Tapia, Lydia	University of New Mexico

15:15-15:20, Paper TuCT1.1	Add to My Program
Individual and Collective Behaviors in Soft Robot Worms Inspired by Living Worm Blobs

Kaeser, Carina	Student
Kwon, Junghan	Pusan National University
Challita, Elio	Harvard University
Tuazon, Harry	Georgia Institute of Technology
Wood, Robert	Harvard University
Bhamla, Saad	Georgia Institute of Technology
Werfel, Justin	Harvard University
Keywords: Swarm Robotics, Biologically-Inspired Robots, Soft Robot Applications Abstract: California blackworms constitute a recently identified animal system exhibiting unusual collective behaviors, in which dozens to thousands of worms entangle to form a "blob" capable of actions like locomotion as an aggregate. In this paper we describe a system of pneumatic soft robots inspired by the blackworms, intended for the study of collective behaviors enabled and mediated by such physical entanglement. Both the robots and worms have high aspect ratio (≳1:50), intertwine in complex 3D configurations, operate both in air and underwater, and can locomote both individually and as a collective. We demonstrate and characterize locomotion for both individual robots and entangled blobs, explore the tunability of entanglement strength, and compare these to the analogous versions in living worms. The robots provide a testbed for studying mechanisms underlying behaviors observed in worm blobs, as well as serving as a platform for studies of novel collective behaviors based on physical entanglement.

15:20-15:25, Paper TuCT1.2	Add to My Program
Informed Repurposing of Quadruped Legs for New Tasks

Chen, Fuchen	Arizona State University
Aukes, Daniel	Arizona State University
Keywords: Mechanism Design, Legged Robots, Compliant Joints and Mechanisms Abstract: Redesigning and remanufacturing robots are infeasible for resource-constrained environments like space or undersea. This work thus studies how to evaluate and repurpose existing, complementary, quadruped legs for new tasks. We implement this approach on 15 robot designs generated from combining six pre-selected leg designs. The performance maps for force-based locomotion tasks like pulling, pushing, and carrying objects are constructed via a learned policy that works across all designs and adapts to the limits of each. Performance predictions agree well with real-world validation results. The robot can locomote at 0.5 body lengths per second while exerting a force that is almost 60% of its weight.

15:25-15:30, Paper TuCT1.3	Add to My Program
Intelligent Self-Healing Artificial Muscle: Mechanisms for Damage Detection and Autonomous Repair of Puncture Damage in Soft Robotics

Krings, Ethan	University of Nebraska-Lincoln
McManigal, Patrick	University of Nebraska-Lincoln
Markvicka, Eric	University of Nebraska-Lincoln
Keywords: Soft Robot Materials and Design, Soft Sensors and Actuators, Soft Robot Applications Abstract: Soft robotics are characterized by their high deformability, mechanical robustness, and inherent resistance to damage. These unique properties present exciting new opportunities to enhance both emerging and existing fields such as healthcare, manufacturing, and exploration. However, to function effectively in unstructured environments, these technologies must be able to withstand the same real-world conditions that human skin and other soft biological materials are typically subjected to. Here, we present a novel soft material architecture designed for active detection of material damage and autonomous repair in soft robotic actuators. By integrating liquid metal (LM) microdroplets within a silicone elastomer, the system can detect and localize damage through the formation of conductive pathways that arise from extreme pressure or puncture events. These newly formed conductive networks function as in situ Joule heating elements, facilitating the reprocessing and healing of the material. The architecture allows for the reconfiguration of the newly formed electrical network using high current densities, employing electromigration and thermal mechanisms to restore functionality without manual intervention. This innovative approach not only enhances the resilience and performance of soft materials but also supports a wide range of applications in soft robotics and wearable technologies, where adaptive and autonomous systems are crucial for operation in dynamic and unpredictable environments.

15:30-15:35, Paper TuCT1.4	Add to My Program
SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models

Wu, Yi	Purdue University
Xiong, Zikang	Deeproute.ai
Hu, Yiran	Purdue University
Iyengar, Shreyash Sridhar	Purdue University
Jiang, Nan	Purdue University
Bera, Aniket	Purdue University
Tan, Lin	Purdue University
Jagannathan, Suresh	Purdue University
Keywords: AI-Based Methods, Autonomous Agents, Agent-Based Systems Abstract: Despite significant advancements in large language models (LLMs) that enhance robot agents’ understanding and execution of natural language (NL) commands, ensuring the agents adhere to user-specified constraints remains challenging, particularly for complex commands and long-horizon tasks. To address this challenge, we present three key insights, equivalence voting, constrained decoding, and domain-specific fine-tuning, which significantly enhance LLM planners’ capability in handling complex tasks. Equivalence voting ensures consistency by generating and sampling multiple Linear Temporal Logic (LTL) formulas from NL commands, grouping equivalent LTL formulas, and selecting the majority group of formulas as the final LTL formula. Constrained decoding then uses the generated LTL formula to enforce the autoregressive inference of plans, ensuring the generated plans conform to the LTL. Domain-specific fine-tuning customizes LLMs to produce safe and efficient plans within specific task domains. Our approach, Safe Efﬁcient LLM Planner (SELP), combines these insights to create LLM planners to generate plans adhering to user commands with high conﬁdence. We demonstrate the effectiveness and generalizability of SELP across different robot agents and tasks, including drone navigation and robot manipulation. For drone navigation tasks, SELP outperforms state-of-the-art planners by 10.8% in safety rate (i.e., finishing tasks conforming to NL commands) and by 19.8% in plan efficiency. For robot manipulation tasks, SELP achieves 20.4% improvement in safety rate. Our datasets for evaluating NL-to-LTL and robot task planning will be released in github.com/lt-asset/selp.

15:35-15:40, Paper TuCT1.5	Add to My Program
Marginalizing and Conditioning Gaussians Onto Linear Approximations of Smooth Manifolds with Applications in Robotics

Guo, Zi Cong	University of Toronto
Forbes, James Richard	McGill University
Barfoot, Timothy	University of Toronto
Keywords: Probability and Statistical Methods, SLAM, Probabilistic Inference Abstract: We present closed-form expressions for marginalizing and conditioning Gaussians onto linear manifolds, and demonstrate how to apply these expressions to smooth nonlinear manifolds through linearization. Although marginalization and conditioning onto axis-aligned manifolds are well-established procedures, doing so onto non-axis-aligned manifolds is not as well understood. We demonstrate the utility of our expressions through three applications: 1) approximation of the projected normal distribution, where the quality of our linearized approximation increases as problem nonlinearity decreases; 2) covariance extraction in Koopman SLAM, where our covariances are shown to be consistent on a real-world dataset; and 3) covariance extraction in constrained GTSAM, where our covariances are shown to be consistent in simulation.

15:40-15:45, Paper TuCT1.6	Add to My Program
Dynamic Tube MPC: Learning Error Dynamics with Massively Parallel Simulation for Robust Safety in Practice

Compton, William	California Institute of Technology
Csomay-Shanklin, Noel	California Institute of Technology
Johnson, Cole	Georgia Institute of Technology
Ames, Aaron	California Institute of Technology
Keywords: Deep Learning Methods, Motion and Path Planning, Robot Safety Abstract: Safe navigation of cluttered environments is a critical challenge in robotics. It is typically approached by separating the planning and tracking problems, with planning executed on a reduced order model to generate reference trajectories, and control techniques used to track these trajectories on the full order dynamics. Inevitable tracking error necessitates robustification of the nominal plan to ensure safety; in many cases, this is accomplished via worst-case bounding, which ignores the fact that some trajectories of the planning model may be easier to track than others. In this work, we present a novel method leveraging massively parallel simulation to learn a dynamic tube representation, which characterizes tracking performance as a function of actions taken by the planning model. Planning model trajectories are then optimized such that the dynamic tube lies in the free space, allowing a balance between performance and safety to be traded off in real time. The resulting Dynamic Tube MPC is applied to the 3D hopping robot ARCHER, enabling agile and performant navigation of cluttered environments, and safe collision-free traversal of narrow corridors.


TuCT2 Regular Session, 301	Add to My Program
SLAM 2

Chair: Wang, Chen	University at Buffalo
Co-Chair: Fallon, Maurice	University of Oxford

15:15-15:20, Paper TuCT2.1	Add to My Program
ISLAM: Imperative SLAM

Fu, Taimeng	University at Buffalo
Su, Shaoshu	State University of New York at Buffalo
Lu, Yiren	Case Western Reserve University
Wang, Chen	University at Buffalo
Keywords: SLAM, Deep Learning Methods Abstract: Simultaneous Localization and Mapping (SLAM) stands as one of the critical challenges in robot navigation. A SLAM system often consists of a front-end component for motion estimation and a back-end system for eliminating estimation drifts. Recent advancements suggest that data-driven methods are highly effective for front-end tasks, while geometry-based methods continue to be essential in the back-end processes. However, such a decoupled paradigm between the data-driven front-end and geometry-based back-end can lead to sub-optimal performance, consequently reducing the system's capabilities and generalization potential. To solve this problem, we proposed a novel self-supervised imperative learning framework, named imperative SLAM (iSLAM), which fosters reciprocal correction between the front-end and back-end, thus enhancing performance without necessitating any external supervision. Specifically, we formulate the SLAM problem as a bilevel optimization so that the front-end and back-end are bidirectionally connected. As a result, the front-end model can learn global geometric knowledge obtained through pose graph optimization by back-propagating the residuals from the back-end component. We showcase the effectiveness of this new framework through an application of stereo-inertial SLAM. The experiments show that the iSLAM training strategy achieves an accuracy improvement of 22% on average over a baseline model. To the best of our knowledge, iSLAM is the first SLAM system showing that the front-end and back-end components can mutually correct each other in a self-supervised manner.

15:20-15:25, Paper TuCT2.2	Add to My Program
DGS-SLAM: A Visual Dense SLAM Based on Gaussian Splatting in Dynamic Environments

Chen, Yushi	Beijing University of Posts and Telecommunications
Liu, Haosong	Beijing University of Posts and Telecommunications
Zhao, Fang	Beijing University of Posts and Telecommunications
Hong, Yunhan	Beijing University of Posts and Telecommunications
Yan, Jiaquan	Beijing University of Posts and Telecommunications
Luo, Haiyong	Institute of Computing Technology, Chinese Academy of Sciences
Keywords: SLAM, RGB-D Perception, Mapping Abstract: Visual dense SLAM can facilitate pose estimation and map reconstruction for sensor carriers in unknown environments. However, in uncontrolled environments such as offices, shopping malls, and train stations, frequent occurrences of people walking back and forth or temporary movement of objects within the scene are common. Most existing visual dense SLAM systems do not account for these dynamic factors, leading to localization drift and map distortion. In this paper, we propose DGS-SLAM, a system capable of achieving robust localization and high-fidelity static map reconstruction in dynamic environments. We utilize semantic 3D Gaussians for scene representation, effectively eliminating interference from dynamic objects and refining the reconstruction of static background. We enhance the tracking accuracy and mapping quality of dense SLAM by using a distance distribution-based Gaussian pruning algorithm and implementing a coarse-to-fine tracking strategy with bundle adjustment and differentiable rendering. We perform qualitative and quantitative evaluations on two publicly available dynamic environment datasets. The results indicate that our method effectively reduces the interference caused by dynamic objects, enabling visual dense SLAM to maintain competitive tracking accuracy and mapping performance in dynamic environments.

15:25-15:30, Paper TuCT2.3	Add to My Program
ARS-SLAM: Accurate Robust Spinning LiDAR SLAM for a Quadruped Robot in Large-Scale Scenario

Li, Jiehao	South China University of Technology
Li, Chenglin	South China Agricultural University
Chen, Hongkai	South China Agricultural University
Guo, Haijun	South China Agricultural University
Luo, Xiwen	South China Agricultural University
Chen, C. L. Philip	South China University of Technology
Yang, Chenguang	University of Liverpool
Keywords: SLAM, Legged Robots, Mapping Abstract: It is challenging to employ a quadruped robot for real-time mapping and positioning in a large range of scenes. Due to the large vibration and instability of the quadruped robot in the process of motion and a large amount of calculation when expressing a large range of dense scenes, the accuracy of the drawing construction is poor and the real-time performance is poor. Therefore, we propose an accurate robust spinning LiDAR SLAM (ARS-SLAM) algorithm for a quadruped robot under the large-scale scene. The tightly coupled iterative Kalman filter in FAST-LIO2 is introduced into the front end of the cartographer framework to improve the accuracy and robustness of robot pose estimation. To reduce the computational complexity of the original cartographer framework, a pose threshold optimization algorithm was introduced to effectively remove redundant information from loop detection and improve computational efficiency and real-time performance. We tested the system's performance against the most advanced point-cloud-based methods, LIO-SAM and FAST-LIO2, on a large dataset of large science parks and underground parking lots, and the results show that the proposed system achieves the same or better accuracy and real-time performance.

15:30-15:35, Paper TuCT2.4	Add to My Program
Tightly Coupled Range Inertial Odometry and Mapping with Exact Point Cloud Downsampling

Koide, Kenji	National Institute of Advanced Industrial Science and Technology
Takanose, Aoki	National Institute of Advanced Industrial Science and Technology
Oishi, Shuji	National Institute of Advanced Industrial Science and Technology
Yokozuka, Masashi	Nat. Inst. of Advanced Industrial Science and Technology
Keywords: SLAM, Localization, Mapping Abstract: In this work, to facilitate the real-time processing of multi-scan registration error minimization on factor graphs, we devise a point cloud downsampling algorithm based on coreset extraction. This algorithm extracts a subset of the residuals of input points such that the subset yields exactly the same quadratic error function as that of the original set for a given pose. This enables a significant reduction in the number of residuals to be evaluated without approximation errors at the sampling point. Using this algorithm, we devise a complete SLAM framework that consists of odometry estimation based on sliding window optimization and global trajectory optimization based on registration error minimization over the entire map, both of which can run in real time on a standard CPU. The experimental results demonstrate that the proposed framework outperforms state-of-the-art CPU-based SLAM frameworks without the use of GPU acceleration.

15:35-15:40, Paper TuCT2.5	Add to My Program
Scalable Multi-Session Visual SLAM in Large-Scale Scenes with Subgraph Optimization

Pan, Xiaokun	Zhejiang University
Li, Zhenzhe	Zhejiang University
Fan, Tianxing	Zhejiang University
Zhai, Hongjia	Zhejiang University
Bao, Hujun	Zhejiang University
Zhang, Guofeng	Zhejiang University
Keywords: SLAM, Localization, Mapping Abstract: Multi-session visual SLAM systems enable 6-DoF camera localization along with long-term maintenance and expansion of the global map, by utilizing image data from different sessions. However, in large-scale environments, these systems often suffer from severe scale drift. While modern SLAM systems attempt to maintain global map consistency through loop detection and correction, they still face challenges in terms of convergence and accuracy. In this paper, we propose a robust large-scale multi-session SLAM system for long-term localization and mapping that achieves global consistency. Furthermore, to address the backend optimization problem in large-scale environments, we introduce a hierarchical optimization strategy based on the graph structure. More specifically, a subgraph structure is introduced to reduce the size of problem while effectively propagating scale correction information. In addition, a hierarchical strategy enables coarse-to-fine updates of the graph states. Experimental results not only demonstrate that our method efficiently optimizes the pose graph and maintains map consistency in large-scale environments, but also highlight the effectiveness and scalability of the proposed approach.

15:40-15:45, Paper TuCT2.6	Add to My Program
Accurate and Rapidly-Convergent GNSS/INS/LiDAR Tightly-Coupled Integration Via Invariant EKF Based on Two-Frame Group (I)

Xia, Chunxi	Wuhan University
Li, Xingxing	Wuhan University
He, Feiyang	China Ship Development and Design Center
Li, Shengyu	Wuhan University
Zhou, Yuxuan	Wuhan University
Keywords: SLAM, Autonomous Vehicle Navigation Abstract: Nowadays, increasing attention has been directed toward the integration of global navigation satellite system (GNSS), inertial navigation satellite system (INS), and light detection and ranging (LiDAR) for intelligent system navigation. However, the existing systems, which generally adopt estimators of the extended Kalman filter (EKF) or factor graph optimization (FGO), still face challenges regarding consistency and convergence. Such methods could provide optimal navigation solutions only if the initial guess of the state is sufficiently close to the true trajectory; otherwise, the systems might undergo accuracy loss or even worse, divergence. To address this issue, we derive an invariant extended Kalman filter (IEKF) based on the two-frame group (TFG) in the left-invariant form, and integrate raw GNSS double-differenced observations, inertial measurements, and LiDAR plane features within this framework. By designing a unified group structure that simultaneously maintains both the navigation states and inertial measurement unit (IMU) biases, TFG contributes to the approximate log-linearity and invariance of the system dynamics model, expected to effectively resolve the convergence issue. A set of real-world experiments was conducted to evaluate the system, with results indicating its potential to achieve submeter to centimeter-level positioning accuracy, surpassing state-of-the-art methods in terms of accuracy, availability, and convergence.


TuCT3 Regular Session, 303	Add to My Program
3D Content Capture and Generation 2

Chair: Elghazaly, Gamal	University of Luxembourg
Co-Chair: Bano, Sophia	University College London

15:15-15:20, Paper TuCT3.1	Add to My Program
Winding Number-Guided Edge-Preserving Implicit Neural Representation of CAD Surfaces

Cheng, Yuhang	Southwest University
Wang, Zhiyuan	Southwest University
He, Jialan	Southwest University
Wang, Xiaogang	Southwest University
Keywords: Computer Vision for Manufacturing, Semantic Scene Understanding Abstract: Implicit Neural representations have emerged as a powerful tool for the task of 3D reconstruction due to their excellent performance. However, the existing methods cannot achieve ideal results on CAD models, mainly because they are usually constructed from piecewise smooth surfaces and have sharp edge structure. To this end, we propose a winding number-guided implicit surface reconstruction method, which mainly consists of a winding number-guided regularizer and a dynamic edge sampling strategy. Among them, the winding number-guided regularizer can effectively constrain the global normal consistency of the input raw data, as well as improve the unsatisfactory implicit surface reconstruction result caused by the unavailability of normal information. Meanwhile, in order to reduce the excessive smoothing at sharp edges of implicit surface, we proposed a dynamic edge sampling strategy for sampling near the sharp edge regions of 3D shape, which can effectively avoid the regularizer from smoothing all regions. Finally, we combine them with a simple data term for robust implicit surface reconstruction. Compared with the state-of-the-art methods, experimental results show that our method significantly improves the quality of 3D reconstruction results.

15:20-15:25, Paper TuCT3.2	Add to My Program
A Refined 3D Gaussian Representation for High-Quality Dynamic Scene Reconstruction

Zhang, Bin	Guangdong University of Technology
Zeng, Bi	Guangdong University of Technology
Peng, Zexin	Guangdong University of Technology
Keywords: Visual Learning Abstract: In recent years, Neural Radiance Fields (NeRF) has revolutionized three-dimensional (3D) reconstruction with its implicit representation. Building upon NeRF, 3D Gaussian Splatting (3D-GS) has departed from the implicit representation of neural networks and instead directly represents scenes as point clouds with Gaussian-shaped distributions. While this shift has notably elevated the rendering quality and speed of radiance fields but inevitably led to a significant increase in memory usage. Additionally, effectively rendering dynamic scenes in 3D-GS has emerged as a pressing challenge. To address these concerns, this paper proposes a refined 3D Gaussian representation for high-quality dynamic scene reconstruction. Firstly, we use a deformable multi-layer perceptron (MLP) network to capture the dynamic offset of Gaussian points and express the color features of points through hash encoding and a tiny MLP to reduce storage requirements. Subsequently, we introduce a learnable denoising mask coupled with denoising loss to eliminate noise points from the scene, thereby further compressing 3D Gaussian model. Finally, motion noise of points is mitigated through static constraints and motion consistency constraints. Experimental results demonstrate that our method surpasses existing approaches in rendering quality and speed, while significantly reducing the memory usage associated with 3D-GS, making it highly suitable for various tasks such as novel view synthesis, and dynamic mapping.

15:25-15:30, Paper TuCT3.3	Add to My Program
GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction

Xiang, Haodong	Tsinghua University
Li, Xinghui	Tsinghua University
Cheng, Kai	USTC
Lai, Xiansong	Tsinghua University
Zhang, Wanting	Tsinghua University
Liao, Zhichao	Tsinghua University
Zeng, Long	Tsinghua University
Liu, Xueping	Tsinghua University
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Visual Learning Abstract: Embodied intelligence requires precise reconstruction and rendering to simulate large-scale real-world data. Although 3D Gaussian Splatting (3DGS) has recently demonstrated high-quality results with real-time performance, it still faces challenges in indoor scenes with large, textureless regions, resulting in incomplete and noisy reconstructions due to poor point cloud initialization and underconstrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we propose a unified optimization framework that integrates neural signed distance fields (SDFs) with 3DGS for accurate geometry reconstruction and real-time rendering. This framework incorporates a neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to model scenes accurately even with poor initialized point clouds. Simultaneously, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we introduce two regularization terms based on normal and edge priors to resolve geometric ambiguities in textureless areas and enhance detail accuracy. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis.

15:30-15:35, Paper TuCT3.4	Add to My Program
Hide-In-Motion: Embedding Steganographic Copyright Information into 4D Gaussian Splatting Assets

Liu, Hengyu	The Chinese University of Hong Kong
Li, Chenxin	Chinese University of Hong Kong
Pan, Wentao	The Chinese University of Hong Kong
Yang, Zhiqin	The Chinese University of Hong Kong
Yang, Yifeng	Shanghai Jiao Tong University
Liu, Yifan	Chinese University of Hong Kong
Li, Wuyang	City University of Hong Kong
Yuan, Yixuan	Chinese University of Hong Kong
Keywords: Visual Learning, Deep Learning for Visual Perception, RGB-D Perception Abstract: As 4D extensions of 3D Gaussian Splatting (4D-GS) emerge as groundbreaking techniques for dynamic scene reconstruction and novel view synthesis in robotics and computer vision, ensuring the security and trustworthiness of these assets becomes crucial. While steganography has advanced significantly in 2D and 3D media, existing methods are inadequate for the complex, dynamic nature of 4D-GS representations. To address this gap, we propose name, a novel 4D steganography method for hiding information through deformation in Gaussian splatting. Our approach introduces a composite attribute and a Decouple Feature Field for coarse-to-fine deformation modeling and embedding implicit information, along with an Opacity-Guided Adaptive strategy. name~overcomes the limitations of previous techniques, enhancing both the robustness of embedded information and the quality of 4D reconstruction. Extensive evaluations demonstrate that our method successfully embeds and recovers implicit information across various modalities while maintaining high rendering quality in dynamic scenes. This work not only advances copyright protection and secure data transmission for 4D assets but also paves the way for enhancing the security and integrity of 4D digital assets. Code is available at https://github.com/CUHK-AIM-Group/Hide-in-Motion.

15:35-15:40, Paper TuCT3.5	Add to My Program
DENSER: 3D Gaussian Splatting for Scene Reconstruction of Dynamic Urban Environments

Mohamad, Mahmud Ali	University of Luxembourg
Elghazaly, Gamal	University of Luxembourg
Hubert, Arthur	University of Luxembourg
Frank, Raphael	University of Luxembourg
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation Abstract: This paper presents DENSER, a framework leveraging 3D Gaussian splatting (3DGS) for the reconstruction of dynamic urban environments. While several methods for photorealistic scene representations, both implicitly using neural radiance fields (NeRF) and explicitly using 3DGS have shown promising results in scene reconstruction of relatively complex dynamic scenes, modeling the dynamic appearance of foreground objects tends to be challenging, limiting the applicability of these methods to capture subtleties and details of the scenes, especially for dynamic objects. To this end, we propose DENSER, a framework that significantly enhances the representation of dynamic objects and accurately models the appearance of dynamic objects in the driving scene. Instead of directly using Spherical Harmonics (SH) to model the appearance of dynamic objects, we introduce and integrate a new method aiming at dynamically estimating SH bases using wavelets, resulting in better representation of dynamic objects appearance in both space and time. Besides object appearance, DENSER enhances object shape representation through densification of its point cloud across multiple scene frames, resulting in faster convergence of model training. Extensive evaluations on the KITTI dataset show that the proposed approach outperforms state-of-the-art methods by a wide margin. Source codes and models will be uploaded to this repository https://github.com/sntubix/denser

15:40-15:45, Paper TuCT3.6	Add to My Program
Factorized Multi-Resolution HashGrid for Efficient Neural Radiance Fields: Execution on Edge-Devices

Jun-Seong, Kim	POSTECH
Kim, Mingyu	UBC
Kim, GeonU	POSTECH
Oh, Tae-Hyun	POSTECH
Kim, Jin-Hwa	NAVER Cloud
Keywords: Computational Geometry, Mapping Abstract: We introduce Fact-Hash, a novel parameter-encoding method for training on-device neural radiance fields. Neural Radiance Fields (NeRF) have proven pivotal in 3D representations, but their applications are limited due to large computational resources. On-device training can open large application fields, providing strength in communication limitations, privacy concerns, and fast adaptation to a frequently changing scene. However, challenges such as limited resources (GPU memory, storage, and power) impede their deployment. To handle this, we introduce Fact-Hash, a novel parameter-encoding merging Tensor Factorization and Hash-encoding techniques. This integration offers two benefits: the use of rich high resolution features and the few-shot robustness. In Fact-Hash, we project 3D coordinates into multiple lower-dimensional forms (2D or 1D) before applying the hash function and then aggregate them into a single feature. Comparative evaluations against state-of-the-art methods demonstrate Fact-Hash's superior memory efficiency, preserving quality and rendering speed. Fact-Hash saves memory usage by over one-third while maintaining the PSNR values compared to previous encoding methods. The on-device experiment validates the superiority of Fact-Hash compared to alternative positional encoding methods in computational efficiency and energy consumption. These findings highlight Fact-Hash as a promising solution to improve feature grid representation, address memory constraints, and improve quality in various applications.

15:45-15:50, Paper TuCT3.7	Add to My Program
3D Uncertain Implicit Surface Mapping Using GMM and GP

Zou, Qianqian	Leibniz University Hannover
Sester, Monika	Leibniz University Hannover, Institute of Cartography and Geoinf
Keywords: Mapping, Probability and Statistical Methods, Range Sensing Abstract: In this study, we address the challenge of constructing continuous three-dimensional (3D) models that accurately represent uncertain surfaces, derived from noisy and incomplete LiDAR scanning data. Building upon our prior work, which utilized the Gaussian Process (GP) and Gaussian Mixture Model (GMM) for structured building models, we introduce a more generalized approach tailored for complex surfaces in urban scenes, where GMM Regression and GP with derivative observations are applied. A Hierarchical GMM (HGMM) is employed to optimize the number of GMM components and speed up the GMM training. With the prior map obtained from HGMM, GP inference is followed for the refinement of the final map. Our approach models the implicit surface of the geo-object and enables the inference of the regions that are not completely covered by measurements. The integration of GMM and GP yields well-calibrated uncertainty estimates alongside the surface model, enhancing both accuracy and reliability. The proposed method is evaluated on real data collected by a mobile mapping system. Compared to the performance in mapping accuracy and uncertainty quantification of other methods, such as Gaussian Process Implicit Surface map (GPIS) and log-Gaussian Process Implicit Surface map (Log-GPIS), the proposed method achieves lower RMSEs, higher log-likelihood values and lower computational costs for the evaluated datasets.


TuCT4 Regular Session, 304	Add to My Program
Object Detection 1

Chair: Brandt, Laura Eileen	Massachusetts Institute of Technology
Co-Chair: Martinson, Eric	Lawrence Technological University

15:15-15:20, Paper TuCT4.1	Add to My Program
Mono-Camera-Only Target Chasing for a Drone in a Dense Environment by Cross-Modal Learning

Yoo, Seungyeon	Seoul National University
Jung, Seungwoo	Seoul National University
Lee, Yunwoo	Seoul National University
Shim, Dongseok	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Visual Learning, Deep Learning Methods, Vision-Based Navigation Abstract: Chasing a dynamic target in a dense environment is one of the challenging applications of autonomous drones. The task requires multi-modal data, such as RGB and depth, to accomplish safe and robust maneuver. However, using different types of modalities can be difficult due to the limited capacity of drones in aspects of hardware complexity and sensor cost. Our framework resolves such restrictions in the target chasing task by using only a monocular camera instead of multiple sensor inputs. From an RGB input, the perception module can extract a cross-modal representation containing information from multiple data modalities. To learn cross-modal representations at training time, we employ variational auto-encoder (VAE) structures and the joint objective function across heterogeneous data. Subsequently, using latent vectors acquired from the pre-trained perception module, the planning module generates a proper next-time-step waypoint by imitation learning of the expert, which performs a numerical optimization using the privileged RGB-D data. Furthermore, the planning module considers temporal information of the target to improve tracking performance through consecutive cross-modal representations. Ultimately, we demonstrate the effectiveness of our framework through the reconstruction results of the perception module, the target chasing performance of the planning module, and the zero-shot sim-to-real deployment of a drone.

15:20-15:25, Paper TuCT4.2	Add to My Program
CoopDETR: A Unified Cooperative Perception Framework for 3D Detection Via Object Query

Wang, Zhe	Institute for AI Industry Research, Tsinghua University
Xu, Shaocong	Xiamen University
Xucai, Zhuang	Institute for Al Industry Research, Tsinghua University
Xu, Tongda	Tsinghua University
Wang, Yan	Tsinghua University
Liu, Jingjing	Institute for AI Industry Research (AIR), Tsinghua University
Chen, Yilun	Tsinghua University
Zhang, Ya-Qin	Institute for AI Industry Research(AIR), Tsinghua University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Autonomous Agents Abstract: Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.

15:25-15:30, Paper TuCT4.3	Add to My Program
Learning Better Representations for Crowded Pedestrians in Offboard LiDAR-Camera 3D Tracking-By-Detection

Li, Shichao	Zhuoyu Technology
Li, Peiliang	HKUST, Robotics Institute
Lian, Qing	HKUST
Yun, Peng	The Hong Kong University of Science and Technology
Chen, Xiaozhi	DJI
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Representation Learning Abstract: Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.

15:30-15:35, Paper TuCT4.4	Add to My Program
Bi-Stream Knowledge Transfer for Semi-Supervised 3D Point Cloud Object Detection

Zheng, Jilai	Shanghai Jiao Tong University
Tang, Pin	Shanghai Jiao Tong University, China
Ren, Xiangxuan	Shanghai Jiao Tong University
Wang, Zhongdao	Noah's Ark Laboratory
Ma, Chao	Shanghai Jiao Tong University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Autonomous Vehicle Navigation Abstract: 3D point cloud object detection plays an important role in autonomous driving. However, labeling 3D object boxes is expensive and time-consuming, limiting the number of annotated point clouds used in fully-supervised training. This has led to a rise in semi-supervised 3D object detection research, which aims to improve model performance by leveraging both labeled and unlabeled point clouds. Existing methods typically rely on the Mean Teacher (MT) paradigm, which uses unlabeled instances discovered by the teacher with confidence scores higher than certain thresholds to train the student. However, this leads to a loss of information as it overlooks ambiguous instances from the teacher that could also contain valuable knowledge. To address this issue, we propose a Bi-Stream Knowledge Transfer (BiKT) framework that fully exploits and transfers knowledge from both confident and ambiguous instances to the student network. Specifically, all pseudo labels are allocated into two knowledge streams, the deterministic stream and the noisy stream, and then subsequently guide the student network through bi-level supervision. We also introduce a Dynamic Stream Switching (DSS) algorithm that sets the stream boundary tailored for the current learning status. To further improve the quality of pseudo labels in the knowledge streams, we propose a Diffusive Label Denoising (DLD) module, which is trained by explicitly generating noised instances and then learning to denoise them, as in diffusion models. Experiments show the state-of-the-art performance of our BiKT on the ONCE validation and testing sets, as well as the robust generalization capability when confronted with diverse base detectors, increased amount of unlabeled data, and distinct datasets (e.g., Waymo), unveiling the power of semi-supervised learning in 3D object detection.

15:35-15:40, Paper TuCT4.5	Add to My Program
Semantic-Supervised Spatial-Temporal Fusion for LiDAR-Based 3D Object Detection

Wang, Chaoqun	The Chinese University of Hong Kong, Shenzhen
Hong, Xiaobin	Nanjing University
Li, Wenzhong	Nanjing University
Zhang, Ruimao	The Chinese University of Hong Kong (Shenzhen)
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Sensor Fusion Abstract: LiDAR-based 3D object detection presents significant challenges due to the inherent sparsity of LiDAR points. A common solution involves long-term temporal LiDAR data to densify the inputs. However, efficiently leveraging spatial-temporal information remains an open problem. In this paper, we propose a novel Semantic-Supervised Spatial-Temporal Fusion (ST-Fusion) method, which introduces a novel fusion module to relieve the spatial misalignment caused by the object motion over time and a feature-level semantic supervision to sufficiently unlock the capacity of the proposed fusion module. Specifically, the ST-Fusion consists of a Spatial Aggregation (SA) module and a Temporal Merging (TM) module. The SA module employs a convolutional layer with progressively expanding receptive fields to aggregate the object features from the local regions to alleviate the spatial misalignment, the TM module dynamically extracts object features from the preceding frames based on the attention mechanism for a comprehensive sequential presentation. Besides, in the semantic supervision, we propose a Semantic Injection method to enrich the sparse LiDAR data via injecting the point-wise semantic labels, using it for training a teacher model and providing a reconstruction target at the feature level supervised by the proposed object-aware loss. Extensive experiments on various LiDAR-based detectors demonstrate the effectiveness and universality of our proposal, yielding an improvement of approximately +2.8% in NDS based on the nuScenes benchmark.

15:40-15:45, Paper TuCT4.6	Add to My Program
OoDIS: Anomaly Instance Segmentation and Detection Benchmark

Nekrasov, Alexey	RWTH Aachen University
Zhou, Rui	Rheinisch-Westfälische Technische Hochschule
Ackermann, Miriam	Ruhr Universität Bochum
Hermans, Alexander	RWTH Aachen University
Leibe, Bastian	RWTH Aachen University
Rottmann, Matthias	University of Wuppertal
Keywords: Data Sets for Robotic Vision, Failure Detection and Recovery, Object Detection, Segmentation and Categorization Abstract: Safe navigation of self-driving cars and robots requires a precise understanding of their environment. Training data for perception systems cannot cover the wide variety of objects that may appear during deployment. Thus, reliable identification of unknown objects, such as wild animals and untypical obstacles, is critical due to their potential to cause serious accidents. Significant progress in semantic segmentation of anomalies has been facilitated by the availability of out-of-distribution (OOD) benchmarks. However, a comprehensive understanding of scene dynamics requires the segmentation of individual objects, and thus the segmentation of instances is essential. Development in this area has been lagging, largely due to the lack of dedicated benchmarks. The situation is similar in object detection. While there is interest in detecting and potentially tracking every anomalous object, the availability of dedicated benchmarks is clearly limited. To address this gap, this work extends some commonly used anomaly segmentation benchmarks to include the instance segmentation and object detection tasks. Our evaluation of anomaly instance segmentation and object detection methods shows that both of these challenges remain unsolved problems. We provide a competition and benchmark website under https://vision.rwth-aachen.de/oodis.

15:45-15:50, Paper TuCT4.7	Add to My Program
Hybrid Attention for Robust RGB-T Pedestrian Detection in Real-World Conditions

Rathinam, Arunkumar	University of Luxembourg
Pauly, Leo	City, University of London
Shabayek, Abd El Rahman	SnT, University of Luxembourg, Luxembourg
Rharbaoui, Wassim	University of Le Mans
Kacem, Anis	University of Luxembourg
Gaudillière, Vincent	Inria, CNRS, Université De Lorraine
Aouada, Djamila	SnT, University of Luxembourg
Keywords: Object Detection, Segmentation and Categorization, Human Detection and Tracking, Deep Learning for Visual Perception Abstract: Multispectral pedestrian detection has gained significant attention in recent years, particularly in autonomous driving applications. To address the challenges posed by adversarial illumination conditions, the combination of thermal and visible images has demonstrated its advantages. However, existing fusion methods rely on the critical assumption that the RGB-Thermal (RGB-T) image pairs are fully overlapping. These assumptions often do not hold in real-world applications, where only partial overlap between images can occur due to the sensor configuration. Moreover, sensor failure can cause loss of information in one modality. In this paper, we propose a novel module called the Hybrid Attention (HA) mechanism as our main contribution to mitigate performance degradation caused by partial overlap and sensor failure, i.e. when at least part of the scene is acquired by a single sensor. We propose an improved RGB-T fusion algorithm, robust against partial overlap and sensor failure encountered during inference in real-world applications. We also leverage a mobile-friendly backbone to cope with resource constraints in embedded systems. We conducted experiments by simulating various partial overlap and sensor failure scenarios to evaluate the performance of our proposed method. The results demonstrate that our approach outperforms state-of-the-art methods, showcasing its superiority in handling real-world challenges.


TuCT5 Regular Session, 305	Add to My Program
Aerial Robots: Trajectory Planning and Control

Chair: Cheng, Sheng	University of Illinois Urbana-Champaign
Co-Chair: Zhang, Hong	SUSTech

15:15-15:20, Paper TuCT5.1	Add to My Program
Safe Interval Motion Planning for Quadrotors in Dynamic Environments

Huang, Songhao	University of Pennsylvania
Wu, Yuwei	University of Pennsylvania
Tao, Yuezhan	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Aerial Systems: Applications, Motion and Path Planning, Constrained Motion Planning Abstract: Trajectory generation in dynamic environments presents a significant challenge for quadrotors, particularly due to the non-convexity in the spatial-temporal domain. Many existing methods either assume simplified static environments or struggle to produce optimal solutions in real-time. In this work, we propose an efficient safe interval motion planning framework for navigation in dynamic environments. A safe interval refers to a time window during which a specific configuration is safe. Our approach addresses trajectory generation through a two-stage process: a front-end graph search step followed by a back-end gradient-based optimization. We ensure completeness and optimality by constructing a dynamic connected visibility graph and incorporating low-order dynamic bounds within safe intervals and temporal corridors. To avoid local minima, we propose a Uniform Temporal Visibility Deformation (UTVD) for the complete evaluation of spatial-temporal topological equivalence. We represent trajectories with B-Spline curves and apply gradient-based optimization to navigate around static and moving obstacles within spatial-temporal corridors. Through simulation and real-world experiments, we show that our method can achieve a success rate of over 95% in environments with different density levels, exceeding the performance of other approaches, demonstrating its potential for practical deployment in highly dynamic environments.

15:20-15:25, Paper TuCT5.2	Add to My Program
Towards Safe and Energy-Efficient Real-Time Motion Planning in Windy Urban Environments

Folk, Spencer	University of Pennsylvania
Melton, John	NASA Ames Research Center
Margolis, Benjamin W. L.	NASA Ames Research Center
Yim, Mark	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Aerial Systems: Perception and Autonomy, Energy and Environment-Aware Automation, Autonomous Vehicle Navigation Abstract: Urban winds are a serious hazard for low-altitude autonomous aerial operations in urban airspaces. Previous methods for motion planning in urban winds require global knowledge of the obstacles and flow field and do not lend themselves to real-time application. In this paper, a planning and control framework is proposed for safe and energy-efficient navigation through urban flow fields that strictly relies on onboard sensing. The algorithm incorporates predictions of local wind flow fields into a receding horizon optimal controller, balancing energy consumption with obstacle avoidance on the fly to reach a goal destination. Simulation studies on a procedurally generated urban map with diverse wind conditions demonstrate that the energy-aware motion planner reduces energy consumption by as much as 30% and results in 32% fewer crashes on average compared to the wind-agnostic baseline. Comparisons to a global wind-aware planner indicate only minor trade-offs associated with planning on a local horizon.

15:25-15:30, Paper TuCT5.3	Add to My Program
Dynamic Perception-Enhanced Motion Planning and Control for UAVs Flights in Challenging Dynamic Environments

Liu, Luyao	Southern University of Science and Technology
Xu, Jiarui	Southern University of Science and Technology
Zhang, Hong	SUSTech
Keywords: Aerial Systems: Perception and Autonomy, Collision Avoidance, Motion and Path Planning Abstract: The autonomous flights of unmanned aerial vehicles (UAVs) in unknown environments have garnered significant attention. However, most existing methods only achieve safe navigation in static environments or spacious scenes with few moving obstacles. Motivated by this open problem, this paper presents a complete system for safe and autonomous UAVs flights in unknown clustered environments with multiple dynamic obstacles. To properly represent complex dynamic environments, we develop a 3D dynamic Euclidean Signed Distance Field (ESDF) mapping method that initially segments and tracks dynamic obstacles using a novel feature-based association strategy, while fusing the remaining static obstacles into ESDF map. Then, we propose a joint trajectory planning and motion control framework for safely avoiding surrounding obstacles. Specifically, the gradient-based B-spline trajectory optimization algorithm is employed to generate a collision-free static trajectory with respect to static obstacles. To avoid dynamic obstacles while adaptively tracking the static trajectory, we utilize time-adaptive model predictive control combined with Dynamic Control Barrier Function (D-CBF), which maps the collision avoidance constraints of dynamic obstacles onto the control inputs. Extensive simulated and real-world experiments confirm that our proposed method outperforms previous approaches for UAVs flights in challenging dynamic environments.

15:30-15:35, Paper TuCT5.4	Add to My Program
Real-Time Sampling-Based Online Planning for Drone Interception

Ryou, Gilhyun	Massachusetts Institute of Technology
Lao Beyer, Lukas	Massachusetts Institute of Technology
Karaman, Sertac	Massachusetts Institute of Technology
Keywords: Aerial Systems: Perception and Autonomy, Planning under Uncertainty, AI-Based Methods Abstract: This paper studies high-speed online planning in dynamic environments. The problem requires finding time-optimal trajectories that conform to system dynamics, meeting computational constraints for real-time adaptation, and accounting for uncertainty from environmental changes. To address these challenges, we propose a sampling-based online planning algorithm that leverages neural network inference to replace time-consuming nonlinear trajectory optimization, enabling rapid exploration of multiple trajectory options under uncertainty. The proposed method is applied to the drone interception problem, where a defense drone must intercept a target while avoiding collisions and handling imperfect target predictions. The algorithm efficiently generates trajectories toward multiple potential target drone positions in parallel. It then assesses trajectory reachability by comparing traversal times with the target drone's predicted arrival time, ultimately selecting the minimum-time reachable trajectory. Through extensive validation in both simulated and real-world environments, we demonstrate our method's capability for high-rate online planning and its adaptability to unpredictable movements in unstructured settings.

15:35-15:40, Paper TuCT5.5	Add to My Program
Optimal Trajectory Planning for Cooperative Manipulation with Multiple Quadrotors Using Control Barrier Functions

Pallar, Arpan	Ap7538@nyu.edu
Li, Guanrui	Worcester Polytechnic Institute
Sarvaiya, Mrunal	Agile Robotics and Perception Lab, NYU
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Multi-Robot Systems Abstract: In this paper, we present a novel trajectory planning algorithm for cooperative manipulation with multiple quadrotors using control barrier functions (CBFs). Our ap- proach addresses the complex dynamics of a system in which a team of quadrotors transports and manipulates a cable- suspended rigid-body payload in environments cluttered with obstacles. The proposed algorithm ensures obstacle avoidance for the entire system, including the quadrotors, cables, and the payload in all six degrees of freedom (DoF). We introduce the use of CBFs to enable safe and smooth maneuvers, effectively navigating through cluttered environments while accommodating the system’s nonlinear dynamics. To simplify complex constraints, the system components are modeled as convex polytopes, and the Duality theorem is employed to reduce the computational complexity of the optimization prob- lem. We validate the performance of our planning approach both in simulation and real-world environments using multiple quadrotors. The results demonstrate the effectiveness of the proposed approach in achieving obstacle avoidance and safe trajectory generation for cooperative transportation tasks.

15:40-15:45, Paper TuCT5.6	Add to My Program
Servo Integrated Nonlinear Model Predictive Control for Overactuated Tiltable-Quadrotors

Li, Jinjie	The University of Tokyo
Sugihara, Junichiro	The University of Tokyo
Zhao, Moju	The University of Tokyo
Keywords: Aerial Systems: Mechanics and Control, Motion Control Abstract: Utilizing a servo to tilt each rotor transforms quadrotors from underactuated to overactuated systems, allowing for independent control of both attitude and position, which provides advantages for aerial manipulation. However, this enhancement also introduces model nonlinearity, sluggish servo response, and limited operational range into the system, posing challenges to dynamic control. In this study, we propose a control approach for tiltable-quadrotors based on nonlinear model predictive control (NMPC). Unlike conventional cascade methods, our approach preserves the full dynamics without simplification. It directly uses rotor thrust and servo angle as control inputs, where their limited working ranges are considered input constraints. Notably, we incorporate a first-order servo model within the NMPC framework. Simulation reveals that integrating the servo dynamics is not only an enhancement to control performance but also a critical factor for optimization convergence. To evaluate the effectiveness of our approach, we fabricate a tiltable-quadrotor and deploy the algorithm onboard at 100Hz. Extensive real-world experiments demonstrate rapid, robust, and smooth pose-tracking performance.

15:45-15:50, Paper TuCT5.7	Add to My Program
Geometric Tracking Control of Omnidirectional Multirotors for Aggressive Maneuvers

Lee, Hyungyu	University of Illinois Urbana-Champaign
Cheng, Sheng	University of Illinois Urbana-Champaign
Wu, Zhuohuan	University of Illinois Urbana-Champaign
Lim, Jaeyoung	ETH Zurich
Siegwart, Roland	ETH Zurich
Hovakimyan, Naira	University of Illinois at Urbana-Champaign
Keywords: Aerial Systems: Mechanics and Control, Robust/Adaptive Control Abstract: An omnidirectional multirotor has the maneuverability of decoupled translational and rotational motions, superseding the traditional multirotors' motion capability. Such maneuverability is achieved due to the ability of the omnidirectional multirotor to frequently alter the thrust amplitude and direction. In doing so, the rotors' settling time, which is induced by inherent rotor dynamics, significantly affects the omnidirectional multirotor's tracking performance, especially in aggressive flights. To resolve this issue, we propose a novel tracking controller that takes the rotor dynamics into account and does not require additional rotor state measurement. This is achieved by integrating a linear rotor dynamics model into the vehicle's equations of motion and designing a PD controller to compensate for the effects introduced by rotor dynamics. We prove that the proposed controller yields almost global exponential stability. The proposed controller is validated in experiments, where we demonstrate significantly improved tracking performance in multiple aggressive maneuvers compared with a baseline geometric PD controller.


TuCT6 Regular Session, 307	Add to My Program
Perception for Mobile Robots 3

Chair: Chiu, Han-Pang	SRI International
Co-Chair: Xiao, Jing	Worcester Polytechnic Institute (WPI)

15:15-15:20, Paper TuCT6.1	Add to My Program
Topology-Based Visual Active Room Segmentation

Bao, Chenyu	The Chinese University of Hong Kong, Shenzhen
Hu, Junjie	The Chinese University of Hong Kong, Shenzhen
Zheng, Qiu	The Chinese University of HongKong, Shenzhen
Lam, Tin Lun	The Chinese University of Hong Kong, Shenzhen
Keywords: Semantic Scene Understanding, Perception-Action Coupling, Vision-Based Navigation Abstract: Room segmentation plays a significant role in scene understanding, semantic mapping, and scene coverage for robots navigating in real-world indoor environments. However, most previous works take a passive segmentation that requires a complete and uncluttered grid map as input, often resulting in lower segmentation accuracy and cannot be deployed in unknown environments. In this paper, we propose an active room segmentation framework that can enable a robot to incrementally and autonomously perform room segmentation in cluttered indoor environments. Our framework consists of three key components: i) a door extraction module where a visual semantic feature, specifically, door, is extracted to better identify rooms in cluttered environments, ii) a within-room exploration module that detects frontiers within the currently exploring room, and iii) a topological module that represents connectivity between rooms and determines next room for exploration. We show through experiments that the proposed method depicts two distinct advantages against existing methods in segmentation accuracy and autonomy. The code is available at https://github.com/FreeformRobotics/Active_room_segmentatio n .

15:20-15:25, Paper TuCT6.2	Add to My Program
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation

Anwar, Abrar	University of Southern California
John, Welsh, John Bradford	NVIDIA
Biswas, Joydeep	University of Texas at Austin
Pouya, Soha	Stanford University
Chang, Yan	Nvidia
Keywords: AI-Enabled Robotics, Semantic Scene Understanding, Vision-Based Navigation Abstract: Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: https://nvidia-ai-iot.github.io/remembr.

15:25-15:30, Paper TuCT6.3	Add to My Program
Online Diffusion-Based 3D Occupancy Prediction at the Frontier with Probabilistic Map Reconciliation

Reed, Alec	University of Colorado Boulder
Achey, Lorin	University of Colorado Boulder
Crowe, Brendan	University of Colorado Boulder
Hayes, Bradley	University of Colorado Boulder
Heckman, Christoffer	University of Colorado at Boulder
Keywords: AI-Enabled Robotics, Deep Learning Methods, AI-Based Methods Abstract: Autonomous navigation and exploration in unmapped environments remains a significant challenge in robotics due to the difficulty robots face in making commonsense inference of unobserved geometries. Recent advancements have demonstrated that generative modeling techniques, particularly diffusion models, can enable systems to infer these geometries from partial observation. In this work, we present implementation details and results for real-time, online occupancy prediction using a modified diffusion model. By removing attention-based visual conditioning and visual feature extraction components, we achieve a 73% reduction in runtime with minimal accuracy reduction. These modifications enable occupancy prediction across the entire map, rather than limiting it to the area around the robot where sensor data can be collected. We introduce a probabilistic update method for merging predicted occupancy data into running occupancy maps, resulting in a 71% improvement in predicting occupancy at map frontiers compared to previous methods. Finally, our code and a ROS node for on-robot operation can be found on our website: https://arpg.github.io/scenesense/.

15:30-15:35, Paper TuCT6.4	Add to My Program
Point2Graph: An End-To-End Point Cloud-Based 3D Open-Vocabulary Scene Graph for Robot Navigation

Xu, Yifan	University of Michigan
Luo, Ziming	University of Michigan
Wang, Qianwei	University of Michigan, Ann Arbor
Kamat, Vineet	University of Michigan
Menassa, Carol	University of Michigan-Ann Arbor
Keywords: Object Detection, Segmentation and Categorization, Human Factors and Human-in-the-Loop, Physically Assistive Devices Abstract: Current open-vocabulary scene graph generation algorithms highly rely on both 3D scene point cloud data and posed RGB-D images and thus have limited applications in scenarios where RGB-D images or camera poses are not readily available. To solve this problem, we propose Point2Graph, a novel end-to-end point cloud-based 3D open-vocabulary scene graph generation framework in which the requirement of posed RGB-D image series is eliminated. This hierarchical framework contains room and object detection/segmentation and open-vocabulary classification. For the room layer, we leverage the advantage of merging the geometry-based border detection algorithm with the learning-based region detection to segment rooms and create a “Snap-Lookup” framework for open-vocabulary room classification. In addition, we create an end-to-end pipeline for the object layer to detect and classify 3D objects based solely on 3D point cloud data. Our evaluation results show that our framework can outperform the current state-of-the-art (SOTA) open-vocabulary object and room segmentation and classification algorithm on widely used real-scene datasets. We provide code and videos at: https://point2graph.github.io/

15:35-15:40, Paper TuCT6.5	Add to My Program
Estimating Commonsense Scene Composition on Belief Scene Graphs

Valdes Saucedo, Mario Alberto	Lulea University of Technology
Kottayam Viswanathan, Vignesh	Lulea University of Technology
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Semantic Scene Understanding, Learning Categories and Concepts, AI-Enabled Robotics Abstract: This work establishes the concept of commonsense scene composition, with a focus on extending Belief Scene Graphs by estimating the spatial distribution of unseen objects. Specifically, the commonsense scene composition capability refers to the understanding of the spatial relationships among related objects in the scene, which in this article is modeled as a joint probability distribution for all possible locations of the semantic object class. The proposed framework includes two variants of a Correlation Information (CECI) model for learning probability distributions: (i) a baseline approach based on a Graph Convolutional Network, and (ii) a neuro-symbolic extension that integrates a spatial ontology based on Large Language Models (LLMs). Furthermore, this article provides a detailed description of the dataset generation process for such tasks. Finally, the framework has been validated through multiple runs on simulated data, as well as in a real-world indoor environment, demonstrating its ability to spatially interpret scenes across different room types. For a video of the article, showcasing the experimental demonstration, please refer to the following link: https://youtu.be/f0tqtPVFZ2A

15:40-15:45, Paper TuCT6.6	Add to My Program
VLM-Vac: Enhancing Smart Vacuums through VLM Knowledge Distillation and Language-Guided Experience Replay

Mirjalili, Reihaneh	University of Technology Nuremberg
Krawez, Michael	University of Technology Nuremberg
Walter, Florian	University of Technology Nuremberg
Burgard, Wolfram	University of Technology Nuremberg
Keywords: AI-Enabled Robotics, Domestic Robotics, Learning from Experience Abstract: In this paper, we propose VLM-Vac, a novel framework designed to enhance the autonomy of smart robot vacuum cleaners. Our approach integrates the zero-shot object detection capabilities of a Vision-Language Model (VLM) with a Knowledge Distillation (KD) strategy. By leveraging the VLM, the robot can categorize objects into actionable classes---either to avoid or to suck---across diverse backgrounds. However, frequently querying the VLM is computationally expensive and impractical for real-world deployment. To address this issue, we implement a KD process that gradually transfers the essential knowledge of the VLM to a smaller, more efficient model. Our real-world experiments demonstrate that this smaller model progressively learns from the VLM and requires significantly fewer queries over time. Additionally, we tackle the challenge of continual learning in dynamic home environments by exploiting a novel experience replay method based on language-guided sampling. Our results show that this approach not only reduces energy consumption by 53% compared to cumulative learning but also surpasses conventional vision-based clustering methods, particularly in detecting small objects across diverse backgrounds.

15:45-15:50, Paper TuCT6.7	Add to My Program
Simultaneously Search and Localize Semantic Objects in Unknown Environments

Qian, Zhentian	WPI
Fu, Jie	University of Florida
Xiao, Jing	Worcester Polytechnic Institute (WPI)
Keywords: Reactive and Sensor-Based Planning, SLAM, Planning under Uncertainty Abstract: For a robot in an unknown environment to find a target semantic object, it must perform simultaneous localization and mapping (SLAM) at both geometric and semantic levels using its onboard sensors while planning and executing its motion based on the ever-updated SLAM results. In other words, the robot must simultaneously conduct localization, semantic mapping, motion planning, and execution online in the presence of sensing and motion uncertainty. This is an open problem as it intertwines semantic SLAM and adaptive online motion planning and execution under uncertainty based on perception. Moreover, the goals of the robot's motion change on the fly depending on whether and how the robot can detect the target object. We propose a novel approach to tackle the problem, leveraging semantic SLAM, Bayesian Networks, and online probabilistic motion planning. The results demonstrate our approach's effectiveness and efficiency.


TuCT7 Regular Session, 309	Add to My Program
Legged Locomotion: Novel Platforms

Chair: Ye, Keran	University of California, Riverside
Co-Chair: Goldman, Daniel	Georgia Institute of Technology

15:15-15:20, Paper TuCT7.1	Add to My Program
Integrated Barometric Pressure Sensors on Legged Robots for Enhanced Tactile Exploration of Edges

Van Hauwermeiren, Thijs	Ghent University
Sianov, Anatolii	University of Gent, EELAB
Coene, Annelies	Ghent University
Crevecoeur, Guillaume	Ghent University
Keywords: Legged Robots, Soft Sensors and Actuators, Soft Robot Applications Abstract: This paper presents a new tactile sensor that utilizes an array of barometric pressure sensors encased in a deformable rubber sphere. Designed as an end effector foot of the legged quadruped robot Unitree Go1, the proposed sensor is able to withstand repeated impacts of at least 40N at the feet. Tactile sensing in legged robotics extends their utility specifically in the context of edge detection and exploration. The presented tactile contact framework processes the pressure data to classify the type of contact (no contact, flat or edge) and the orientation of the edges relative to robot base. To assess the performance of the sensors and their ability for tactile edge exploration we extensively test with the legged robot in varied conditions including different terrains, changing payloads, exposure to dynamic disturbances, and sloped edges. The edge detection is compared against the original scalar force sensors. Experiments demonstrate a mean absolute error of 2° for predicted edge orientation at a detection range of 14 mm and robust operation in realistic operating conditions for medium-sized quadruped robots. With this contribution, we aim to enhance the capabilities and safety of legged robots in various applications.

15:20-15:25, Paper TuCT7.2	Add to My Program
Development of a New Biped Robot with Adaptive Suction Modules for Climbing on Curved Surfaces

Li, Zikang	University of Macau
Zhang, Weijian	University of Macau
Wu, Zehao	University of Macau
Xu, Qingsong	University of Macau
Keywords: Climbing Robots, Legged Robots, Robotics and Automation in Construction Abstract: Regular cleaning and maintenance of high-altitude pipes and curved surfaces on high-rise buildings are high-risk tasks for human workers due to the difficulty of working on curved planes. To address such challenge, automated robots are widely used for cleaning buildings with flat walls, but they cannot climb on curved surfaces, limiting their practical applications. This paper proposes a novel biped curved-surface climbing robot (BCCR) with five-degree-of-freedom (5-DOF) motion. The BCCR features adaptive vacuum suction modules that can adhere to both curved and flat surfaces, allowing seamless movement of the BCCR across various surfaces. Each terminal suction module is composed of three small suction cups, which are capable of rotating in all directions to achieve adaptive adhesion on various surfaces. The 5-DOF structure enables the robot to cross obstacles and makes it highly versatile for various cleaning tasks on a wide range of surfaces, including large curved pipes. The mechanism design and analytical modeling of the BCCR are carried out, demonstrating its robust curved-surface climbing capabilities. Moreover, a prototype is fabricated for experimental investigation. The results indicate that the proposed 5-DOF BCCR can achieve stable climbing on curved surfaces.

15:25-15:30, Paper TuCT7.3	Add to My Program
Berkeley Humanoid: A Research Platform for Learning-Based Control

Liao, Qiayuan	University of California, Berkeley
Zhang, Bike	University of California, Berkeley
Huang, Xuanyu	The Hong Kong University of Science and Technology (Guangzhou)
Huang, Xiaoyu	Georgia Institute of Technology
Li, Zhongyu	University of California, Berkeley
Sreenath, Koushil	University of California, Berkeley
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Humanoid Robot Systems Abstract: We introduce Berkeley Humanoid, a reliable and low-cost mid-scale humanoid research platform for learning- based control. Our lightweight, in-house-built robot is designed specifically for learning algorithms with accurate simulation, low simulation complexity, anthropomorphic motion, and high reliability against falls. The narrow sim-to-real gap enables agile and robust locomotion across various terrains in outdoor environments, achieved with a simple reinforcement learning controller using light domain randomization. Furthermore, we demonstrate the robot traversing for hundreds of meters, walking on a steep unpaved trail, and hopping with single and double legs as a testimony to its high performance in dynamic walking. Capable of omnidirectional locomotion and withstanding large perturbations with a compact setup, our system aims for rapid sim-to-real deployment of learning-based humanoid systems. Please check our website https://berkeley-humanoid.com/ and code https://github.com/HybridRobotics/isaac_berkeley_humanoid/.

15:30-15:35, Paper TuCT7.4	Add to My Program
Zippy: The Smallest Power-Autonomous Bipedal Robot

Man, Steven	Carnegie Mellon University
Narita, Soma	Carnegie Mellon University
Macera, Josef	Carnegie Mellon University
Oke, Naomi	Carnegie Mellon University
Johnson, Aaron M.	Carnegie Mellon University
Bergbreiter, Sarah	Carnegie Mellon University
Keywords: Humanoid and Bipedal Locomotion, Passive Walking, Underactuated Robots Abstract: Miniaturizing legged robot platforms is challenging due to hardware limitations that constrain the number, power density, and precision of actuators at that size. By leveraging design principles of quasi-passive walking robots at any scale, stable locomotion and steering can be achieved with simple mechanisms and open-loop control. Here, we present the design and control of “Zippy”, the smallest self-contained bipedal walking robot at only 3.6 cm tall. Zippy has rounded feet, a single motor without feedback control, and is capable of turning, skipping, and ascending steps. At its fastest pace, the robot achieves a forward walking speed of 25 cm/s, which is 10 leg lengths per second, the fastest biped robot of any size by that metric. This work explores the design and performance of the robot and compares it to similar dynamic walking robots at larger scales.

15:35-15:40, Paper TuCT7.5	Add to My Program
Exploration and Analysis of Torso-Limb Coordination of Quadruped Walkers with Compliant Torso

Xiang, Yuxuan	Japan Advanced Institute of Science and Technology
Sedoguchi, Taiki	Japan Advanced Institute of Science and Technology
Zheng, Yanqiu	Ritsumeikan University
Asano, Fumihiko	Japan Advanced Institute of Science and Technology
Keywords: Legged Robots, Underactuated Robots Abstract: Quadrupeds exhibit remarkable locomotion performance through the coordination between their limbs and torso. From past biological knowledge, it is understood that during walking, the forelimbs primarily contribute to braking, while the hindlimb are responsible for propulsion. However, in the field of quadruped robot dynamics, effectively leveraging this coordination remains a challenge. To investigate the torso-limb coordination, this study explores the walking performance of a quadruped walker with a compliant torso, driven by the forelimb or the hindlimb. Through numerical simulations, we analyze the walking behavior under different control drive methods. The findings provide insights into the design of compliant-bodied robots and the optimal distribution of propulsion forces between the forelimbs and hindlimbs.

15:40-15:45, Paper TuCT7.6	Add to My Program
Effective Self-Righting Strategies for Elongate Multi-Legged Robots

Teder, Erik	Hillsdale College
Chong, Baxi	Georgia Institute of Technology
He, Juntao	Georgia Institute of Technology
Wang, Tianyu	Georgia Institute of Technology
Iaschi, Massimiliano	Georgia Institute of Technology
Soto, Daniel	Georgia Institute of Technology
Goldman, Daniel	Georgia Institute of Technology
Keywords: Legged Robots, Field Robots, Multi-Contact Whole-Body Motion Planning and Control Abstract: Centipede-like robots offer an effective and robust solution to navigation over complex terrain with minimal sensing. However, when climbing over obstacles, such multi-legged robots often elevate their center-of-mass into unstable configurations, where even moderate terrain uncertainty can cause tipping over. Robust mechanisms for such elongate multi-legged robots to self-right remain unstudied. Here, we developed a comparative biological and robophysical approach to investigate self-righting strategies. We first released textit{S. polymorpha} upside down from a 10 cm height and recorded their self-righting behaviors using top and side view high-speed cameras. Using kinematic analysis, we hypothesize that these behaviors can be prescribed by two traveling waves superimposed in the body’s lateral and vertical planes, respectively. We tested our hypothesis on an elongate robot with static (non-actuated) limbs, and we successfully reconstructed these self-righting behaviors. We further evaluated how wave parameters affect self-righting effectiveness. We identified two key wave parameters: the spatial frequency, which characterizes the sequence of body-rolling, and the wave amplitude, which characterizes body curvature. By empirically obtaining a behavior diagram of spatial frequency and amplitude, we identify effective and versatile self-righting strategies for general elongate multi-legged robots, which greatly enhances these robots' mobility and robustness in practical applications such as agricultural terrain inspection and search-and-rescue.

15:45-15:50, Paper TuCT7.7	Add to My Program
Addition of a Peristaltic Wave Improves Multi-Legged Locomotion Performance on Complex Terrains

Iaschi, Massimiliano	Georgia Institute of Technology
Chong, Baxi	Georgia Institute of Technology
Wang, Tianyu	Georgia Institute of Technology
Lin, Jianfeng	Georgia Institute of Technology
Xu, Zhaochen	Columbia University
Soto, Daniel	Georgia Institute of Technology
He, Juntao	Georgia Institute of Technology
Goldman, Daniel	Georgia Institute of Technology
Keywords: Legged Robots, Search and Rescue Robots, Biologically-Inspired Robots Abstract: Characterized by their elongate bodies and relatively simple legs, multi-legged robots have the potential to locomote through complex terrains for applications such as search-and-rescue and terrain inspection. Prior work has developed effective and reliable locomotion strategies for multi-legged robots by propagating the two waves of lateral body undulation and leg stepping, which we will refer to as the two-wave template. However, these robots have limited capability to climb over obstacles with sizes comparable to their heights. We hypothesize that such limitations stem from the two-wave template that we used to prescribe the multi-legged locomotion. Seeking effective alternative waves for obstacle-climbing, we designed a five-segment robot with static (non-actuated) legs, where each cable-driven joint has a rotational degree-of-freedom (DoF) in the sagittal plane (vertical wave) and a linear DoF (peristaltic wave). We tested robot locomotion performance on a flat terrain and a rugose terrain. While the benefit of peristalsis on flat-ground locomotion is marginal, the inclusion of a peristaltic wave substantially improves the locomotion performance in rugose terrains: it not only enables obstacle-climbing capabilities with obstacles having a similar height as the robot, but it also significantly improves the traversing capabilities of the robot in such terrains. Our results demonstrate an alternative actuation mechanism for multi-legged robots, paving the way towards all-terrain multi-legged robots.


TuCT8 Regular Session, 311	Add to My Program
Medical Robotics 3

Chair: Diaz-Mercado, Yancy	University of Maryland
Co-Chair: Dogramadzi, Sanja	University of Sheffield

15:15-15:20, Paper TuCT8.1	Add to My Program
In Vivo Feasibility Study: Evaluating Autonomous Data-Driven Robotic Needle Trajectory Correction in MRI-Guided Transperineal Procedures

Bernardes, Mariana C.	Brigham and Women's Hospital / Harvard Medical School
Moreira, Pedro	Brigham and Women's Hospital / Harvard Medical School
Lezcano, Dimitri A.	Johns Hopkins University
Foley, Lori	Brigham and Women's Hospital
Tuncali, Kemal	BWH
Tempany, Clare	Brigham & Women's Hospital, Harvard MEDICAL SCHOOL
Kim, Jin Seob	Johns Hopkins University
Hata, Nobuhiko	Brigham and Women's Hospital
Iordachita, Ioan Iulian	Johns Hopkins University
Tokuda, Junichi	Brigham and Women's Hospital and Harvard Medical School
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles Abstract: This study addresses the targeting challenges in MRI-guided transperineal needle placement for prostate cancer diagnosis and treatment, a procedure where accuracy is crucial for effective outcomes. We introduce a parameter-agnostic trajectory correction approach incorporating a data-driven closed-loop strategy by radial displacement and an FBG-based shape sensing to enable autonomous needle steering. In an animal study designed to emulate clinical complexity and assess MRI compatibility through a mock biopsy procedure, our approach demonstrated a significant improvement in targeting accuracy (p < 0.05), with mean target error of only 2.2 ± 1.9 mm on first insertion attempts, without needle reinsertions. To the best of our knowledge, this work represents the first in vivo evaluation of robotic needle steering with FBG-sensor feedback, marking a significant step towards its clinical translation.

15:20-15:25, Paper TuCT8.2	Add to My Program
Pre-Surgical Planner for Robot-Assisted Vitreoretinal Surgery: Integrating Eye Posture, Robot Position and Insertion Point

Inagaki, Satoshi	NSK.Ltd
Alikhani, Alireza	Augen Klinik Und Poliklinik, Klinikum Rechts Der Isar Der Techn
Navab, Nassir	TU Munich
Issa, Peter Charbel	Klinikum Rechts Der Isar, Technical University of Munich
Nasseri, M. Ali	Technische Universitaet Muenchen
Keywords: Medical Robots and Systems, Surgical Robotics: Planning Abstract: Several robotic frameworks have been recently developed to assist ophthalmic surgeons in performing complex vitreoretinal procedures such as subretinal injection of advanced therapeutics. These surgical robots show promising capabilities; however, most of them have to limit their working volume to achieve maximum accuracy. Moreover, the visible area seen through the surgical microscope is limited and solely depends on the eye posture. If the eye posture, trocar position, and robot configuration are not correctly arranged, the instrument may not reach the target position, and the preparation will have to be redone. Therefore, this paper proposes the optimization framework of the eye tilting and the robot positioning to reach various target areas for different patients. Our method was validated with an adjustable phantom eye model, and the error of this workflow was 0.13 ± 1.65 deg (rotational joint around Y axis), -1.40 ± 1.13 deg (around X axis), and 1.80 ± 1.51 mm (depth, Z). The potential error sources are also analyzed in the discussion section.

15:25-15:30, Paper TuCT8.3	Add to My Program
Suture Thread Modeling Using Control Barrier Functions for Autonomous Surgery

Forghani, Kimia	University of Maryland College Park
Raval, Suraj	University of Maryland, College Park
Mair, Lamar	Weinberg Medical Physics, Inc
Krieger, Axel	Johns Hopkins University
Diaz-Mercado, Yancy	University of Maryland
Keywords: Surgical Robotics: Steerable Catheters/Needles, Distributed Robot Systems, Collision Avoidance Abstract: Automating surgical systems enhances precision and safety while reducing human involvement in high-risk environments. A major challenge in automating surgical procedures like suturing is accurately modeling the suture thread, a highly flexible and compliant component. Existing models either lack the accuracy needed for safety-critical procedures or are too computationally intensive for real-time execution. In this work, we introduce a novel approach for modeling suture thread dynamics using control barrier functions (CBFs), achieving both realism and computational efficiency. Thread-like behavior, collision avoidance, stiffness, and damping are all modeled within a unified CBF and control Lyapunov function (CLFs) framework. Our approach eliminates the need to calculate complex forces or solve differential equations, significantly reducing computational overhead while maintaining a realistic model suitable for both automation and virtual-reality surgical training systems. The framework also allows visual cues to be provided based on the thread’s interaction with the environment, enhancing user experience when performing suture or ligation tasks. The proposed model is tested on the MagnetoSuture system, a minimally invasive robotic surgical platform that uses magnetic fields to manipulate suture needles, offering a less invasive solution for surgical procedures.

15:30-15:35, Paper TuCT8.4	Add to My Program
Robotic Colonoscopy: Can High Fidelity Simulation Optimize Robot Design and Validation?

Evans, Michael	The University of Sheffield
Du, Jiayang	The University of Sheffield
Cao, Lin	University of Sheffield
Dogramadzi, Sanja	University of Sheffield
Keywords: Medical Robots and Systems, Simulation and Animation, Surgical Robotics: Steerable Catheters/Needles Abstract: This paper presents the use of a simulation environment as an accurate, ethical and sustainable alternative to testing robotic prototypes in animal models and simplified phantom models. The simulation is specifically developed for robotic colonoscopy devices inside the human colon. A virtual simulation of the locomotion mechanism of a prototype robotic colonoscope and the colon was created in Ansys, and robot/colon experiments were conducted on different colon surfaces to validate simulation results. The successfully simulated propulsion force generated by the prototype produced an RMSE of 7% when compared at the optimal operating condition of the device, and 25-30% when compared to a full range of device velocities. The larger RMSE is due to physical phenomena that were not present in the simulation due to the constraints applied. The simulation, however, allowed evaluation of difficult quantities to measure in the real settings such as the normal interaction force between the device and tissue wall, and stress distribution across the locomotion mechanism, as well as a phenomenon of oscillating propulsion force resulting from the device design. This work demonstrates feasibility of using finite element simulation to shape the design and optimization of a robotic colonoscope, and understands its interaction with highly complex human anatomy.

15:35-15:40, Paper TuCT8.5	Add to My Program
Robotic Tissue Manipulation in Endoscopic Submucosal Dissection Via Visual Feedback

Zhang, Tao	Arizona State University
Jue, Terry	Mayo Clinic
Marvi, Hamidreza	Arizona State University
Keywords: Medical Robots and Systems Abstract: Colorectal cancer is the third most commonly diagnosed cancer and the second leading cause of cancer-related deaths in the United States. Despite advancements in screening and treatment, there remains a critical need for more effective and minimally invasive methods to manage complex polyps and early-stage colorectal cancers. This study introduces a novel approach to magnetic tissue manipulation for Endoscopic Submucosal Dissection (ESD), leveraging visual feedback to enhance precision and control. We develop and evaluate the proposed system within a ROS Gazebo simulation environment, integrating a small magnetic endoscopic clip affixed to tissue, which is manipulated by an external large magnet mounted on a robotic arm. A key challenge in ESD is achieving adequate tissue exposure for precise cutting, particularly in the confined space of the colon where the endoscope is manually controlled. To address this, our system enables controlled manipulation of the magnetic clip to optimize tissue retraction. The robotic arm, guided by real-time visual feedback, dynamically adjusts the internal clip’s orientation. Multiple virtual cameras were used to validate the proposed method. The simulation results demonstrated that the robot arm successfully manipulated the internal magnetic clip to the desired tilt angle within an average of 8.4 seconds (range 5.3 to 15.2 s). Our findings suggest that robotic-assisted magnetic tissue manipulation has the potential to improve ESD success rates while reducing procedure time, paving the way for further advancements in minimally invasive endoscopic surgery.

15:40-15:45, Paper TuCT8.6	Add to My Program
Learning Based Estimation of Tool-Tissue Interaction Forces for Stationary and Moving Environments

Nowakowski, Lukasz	Western University
Patel, Rajnikant V.	The University of Western Ontario
Keywords: Medical Robots and Systems, Deep Learning Methods, Haptics and Haptic Interfaces Abstract: Accurately estimating tool-tissue interaction forces during robotics-assisted minimally invasive surgery is an important aspect of enabling haptics-based teleoperation. By collecting data regarding the state of a robot in a variety of configurations, neural networks can be trained to predict this interaction force. This paper extends existing work in this domain based on collecting one of the largest known ground truth force datasets for stationary as well as moving phantoms that replicate tissue motions found in clinical procedures. Existing methods, and a new transformer-based architecture, are evaluated to demonstrate the domain gap between stationary and moving phantom tissue data and the impact that data scaling has on each architecture's ability to generalize the force estimation task. It was found that temporal networks were more sensitive to the moving domain than single-sample Feed Forward Networks (FFNs) that were trained on stationary tissue data. However, the transformer approach results in the lowest Root Mean Square Error (RMSE) when evaluating networks trained on examples of both stationary and moving phantom tissue samples. The results demonstrate the domain gap between stationary and moving surgical environments and the effectiveness of scaling datasets for increased accuracy of interaction force prediction.

15:45-15:50, Paper TuCT8.7	Add to My Program
RASEC: Rescaling Acquisition Strategy with Energy Constraints under Fusion Kernel for Active Incision Recommendation in Tracheotomy (I)

Yue, Wenchao	The Chinese University of Hong Kong
Bai, Fan	The Chinese University of Hong Kong
Liu, Jianbang	The Chinese University of Hong Kong
Ju, Feng	Nanjing University of Aeronautics and Astronautics
Meng, Max Q.-H.	The Chinese University of Hong Kong
Lim, Chwee Ming	National University of Singapore
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Telerobotics and Teleoperation Abstract: Tracheotomy is critical for patients needing prolonged intubation or airway management, where accurate incision placement is essential to avoid complications. Current techniques rely on palpating cartilage landmarks, which can be imprecise. This paper presents RASEC, an autonomous palpation-based strategy that enhances robot-assisted tracheotomy by interactively predicting acquisition points to maximize information and minimize palpation costs. We employ a Gaussian Process (GP) to model the distribution of tissue hardness, integrating anatomical data as prior input to guide palpation. A dynamic tactile sensor, based on resonant frequency, measures tissue hardness with millimeter-scale contact for secure interaction. We use kernel fusion, combining the Squared Exponential (SE) and Ornstein-Uhlenbeck (OU) kernels, to optimize Bayesian searches using laryngeal anatomical knowledge. The acquisition strategy also considers the tactile sensor’s movement distance and the robotic base link's rotation during incision localization. Simulations and physical experiments demonstrate a 53.1% reduction in sensor movement distance and a 75.2% improvement in rotation angle. The results yield an average precision of 0.932, recall of 0.973, and F1 score of 0.952, showcasing RASEC’s efficacy in exploration efficiency, cost awareness, and localization accuracy for tracheotomy procedures.


TuCT9 Regular Session, 312	Add to My Program
Motion Planning 3

Chair: Kallmann, Marcelo	Amazon Robotics
Co-Chair: Vundurthy, Bhaskar	Carnegie Mellon University

15:15-15:20, Paper TuCT9.1	Add to My Program
Locally Homotopic Paths: Ensuring Consistent Paths in Hierarchical Path Planning

Wongpiromsarn, Tichakorn	Iowa State University
Kallmann, Marcelo	Amazon Robotics
Kolling, Andreas	Amazon
Keywords: Motion and Path Planning, Integrated Planning and Control, Optimization and Optimal Control Abstract: We consider a local planner that utilizes model predictive control to locally deviate from a prescribed global path in response to dynamic environments, taking into account the system dynamics. To ensure the consistency between the local and global paths, we introduce the concept of locally homotopic paths for paths with different origins and destinations. We then formulate a hard constraint to ensure that local paths are locally homotopic to a given global path. Additionally, we propose a cost function to penalize any violation of this requirement rather than completely prohibiting it. Experimental results show that both variants of our approach are more resilient to localization errors compared to existing methods that represent the homotopy class constraint as an envelope around the global path.

15:20-15:25, Paper TuCT9.2	Add to My Program
Multi-Covering a Point Set by M Disks with Minimum Total Area

Guitouni, Mariem	University of Houston
Loi, Chek-Manh	Technische Universität Braunschweig
Perk, Michael	TU Braunschweig
Fekete, Sándor	Technische Universität Braunschweig
Becker, Aaron	University of Houston
Keywords: Computational Geometry, Aerial Systems: Applications, Optimization and Optimal Control Abstract: A common robotics sensing problem is to place sensors to robustly monitor a set of assets, where robustness is assured by requiring asset p to be monitored by at least kappa(p) sensors. Given n assets that must be observed by m sensors, each with a disk-shaped sensing region, where should the sensors be placed to minimize the total area observed? We provide and analyze a fast heuristic for this problem. We then use the heuristic to initialize an exact Integer Programming solution. Subsequently, we enforce separation constraints between the sensors by modifying the integer program formulation and by changing the disk candidate set.

15:25-15:30, Paper TuCT9.3	Add to My Program
Non-Conservative Obstacle Avoidance for Multi-Body Systems Leveraging Convex Hulls and Predicted Closest Points

Rassaerts, Lotte	Eindhoven University of Technology
Suichies, Eke Janke	Eindhoven University of Technology
van de Vrande, Bram	Philips
Alonso, Marco	Company
Meere, Bastiaan Guillermo Lorenzo	Eindhoven University of Technology
Chong, Michelle S.	Eindhoven University of Technology
Torta, Elena	Eindhoven University of Technology
Keywords: Collision Avoidance, Constrained Motion Planning, Computational Geometry Abstract: This paper introduces a novel approach that integrates future closest point predictions into the distance constraints of a collision avoidance controller, leveraging convex hulls with closest point distance calculations. By addressing abrupt shifts in closest points, this method effectively reduces collision risks and enhances controller performance. Applied to an Image Guided Therapy robot and validated through simulations and user experiments, the framework demonstrates improved distance prediction accuracy, smoother trajectories, and safer navigation near obstacles.

15:30-15:35, Paper TuCT9.4	Add to My Program
Adaptive Distance Functions Via Kelvin Transformation

Cabral Muchacho, Rafael Ignacio	KTH Royal Institute of Technology
Pokorny, Florian T.	KTH Royal Institute of Technology
Keywords: Computational Geometry, Robot Safety Abstract: The term safety in robotics is often understood as a synonym for avoidance. Although this perspective has led to progress in path planning and reactive control, a generalization of this perspective is necessary to include task semantics relevant to contact-rich manipulation tasks, especially during teleoperation and to ensure the safety of learned policies. We introduce the semantics-aware distance function and a corresponding computational method based on the Kelvin Transformation. This allows us to compute smooth distance approximations in an unbounded domain by instead solving a Laplace equation in a bounded domain. The semantics-aware distance generalizes signed distance functions by allowing the zero level set to lie inside of the object in regions where contact is allowed, effectively incorporating task semantics, such as object affordances, in an adaptive implicit representation of safe sets. In numerical experiments we show the computational viability of our method for real applications and visualize the computed function on a wrench with various semantic regions.

15:35-15:40, Paper TuCT9.5	Add to My Program
A Quantum Annealing Approach to Target Tracking

Barbeau, Michel	Carleton University
Janabi-Sharifi, Farrokh	Ryerson University
Masnavi, Houman	Toronto Metropolitan University
Keywords: Motion and Path Planning, Optimization and Optimal Control, Nonholonomic Motion Planning Abstract: This paper delves into the fusion of quantum computing and robotics, focusing on motion planning in cluttered environments. Traditional algorithms struggle with complex problems where many constraints need to be satisfied. Hence, optimization-based approaches such as Constrained Quadratic Models (CQM) have become increasingly popular. Our work presents a 3D tracking algorithm based on CQM uniquely adapted for quantum computers to address computational challenges. With their parallel processing capabilities, Quantum computers offer a groundbreaking approach to optimizing complex problems. We formulate the CQM problem for efficient resolution on the D-Wave quantum computer, showcasing its superiority over classical counterparts. Our application centers on real-time planning in a target-chaser tracking scenario, highlighting the quantum advantage in handling the computation complexity of constrained problems. This paper bridges the quantum-robotics gap and sets the stage for future research in quantum-enhanced robotic motion planning.

15:40-15:45, Paper TuCT9.6	Add to My Program
Provable Methods for Searching with an Imperfect Sensor

Kasthurirangan, Prahlad Narasimhan	Stony Brook University
Nguyen, Linh	Stony Brook University
Perk, Michael	TU Braunschweig
Chakraborty, Nilanjan	Stony Brook University
Mitchell, Joseph	State University of New York at Stony Brook
Keywords: Motion and Path Planning, Computational Geometry, Planning, Scheduling and Coordination Abstract: Assume that a target is known to be present at an unknown point among a finite set of locations in the plane. We search for it using a mobile robot that has imperfect sensing capabilities. It takes time for the robot to move between locations and search a location; we have a total time budget within which to conduct the search. We study the problem of computing a search path/strategy for the robot that maximizes the probability of detection of the target. Considering non-uniform travel times between points (e.g., based on the distance between them) is crucial for search and rescue applications; such problems have been investigated to a limited extent due to their inherent complexity. In this paper, we describe fast algorithms with performance guarantees for this search problem and some variants, complement them with complexity results, and perform experiments to observe their performance.


TuCT10 Regular Session, 313	Add to My Program
Multi-Robot Swarms 2

Chair: Artemiadis, Panagiotis	University of Delaware
Co-Chair: Pimenta, Luciano	Universidade Federal De Minas Gerais

15:15-15:20, Paper TuCT10.1	Add to My Program
Emergence of Collective Behaviors for the Swarm Robotics through Visual Attention-Based Selective Interaction

Zheng, Zhicheng	Northwestern Polytechnical University
Zhou, Yongjian	Northwestern Polytechnical University
Xiang, Yalun	Northwestern Polytechnical University
Lei, Xiaokang	Northwestern Polytechnical University
Peng, Xingguang	Northwestern Polytechnical University
Keywords: Swarm Robotics, Biologically-Inspired Robots, Probability and Statistical Methods Abstract: Plenty of local interaction mechanisms have been proposed to achieve collective behaviors in swarm robotics. However, these mechanisms require robots to explicitly obtain the velocity of their neighbors as the sensory input to make motion decisions. This further poses great challenges in real-world applications of swarm robotics. In this letter, inspired by the chasing behavior in large-scale migrating locusts, we propose a visual attention-based model to achieve collective behaviors with positional interaction. Through numerical simulations, we find the emergence of three typical collective behaviors: flocking, milling and swarming. To gain deep insights into the new proposed model, we investigate the impact of group size and sensory blindness on the emergence of collective behaviors. Moreover, by using the mean field analysis framework, we present the theoretical proof of the emergence of flocking and milling behavior. Furthermore, to validate the feasibility of our proposed model, we reproduce the flocking and milling behavior with up to 50 physical robots. Robotic experiments demonstrate the promising ability of the new proposed model to achieve collective behaviors with the absence of velocity alignment.

15:20-15:25, Paper TuCT10.2	Add to My Program
Safe Radial Segregation Algorithm for Swarms of Dubins-Like Robots

Bernardes Ferreira Filho, Edson	Royal Holloway University of London
Brochero Giraldo, David Felipe	Universidade Federal De Minas Gerais
Dias Nunes, Arthur Henrique	Universidade Federal De Minas Gerais
Pimenta, Luciano	Universidade Federal De Minas Gerais
Keywords: Swarm Robotics, Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems Abstract: This work addresses the problem of radially segregating heterogeneous robotic swarms. Such swarms are those composed of different groups of robots. Unlike other works on segregation in the literature, we propose a controller for Dubins-like robots, motivated by autonomous aerial, wheeled, and underwater vehicles. Our controller can drive the robots individually to converge to circles that are shared only by robots of the same group. We present a heuristic and a collision avoidance scheme in which the information required is locally acquired. We present several simulations widely varying the number of robots per group and the number of groups in which segregation is always reached and collisions between robots are always avoided.

15:25-15:30, Paper TuCT10.3	Add to My Program
Impossibility of Self-Organized Aggregation without Computation

Steinberg, Roy	Technion--Israel Institute of Technology
Solovey, Kiril	Technion--Israel Institute of Technology
Keywords: Swarm Robotics, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: In their seminal work, Gauci et al. (2014) studied the fundamental task of aggregation, wherein multiple robots need to gather without an a priori agreed-upon meeting location, using extremely limited hardware. That paper considered differential-drive robots that are memoryless and unable to compute. Moreover, the robots cannot communicate with one another and are only equipped with a simple sensor that determines whether another robot is directly in front of them. Despite those severe limitations, Gauci et al. introduced a controller and proved mathematically that it aggregates a system of two robots for any initial state. Unfortunately, for larger systems, the same controller aggregates empirically in many cases but not all. Thus, the question of whether there exists a controller that aggregates for any number of robots remains open. In this paper, we show that no such controller exists by investigating the geometric structure of controllers. In addition, we disprove the aggregation proof of the aforementioned paper for two robots and present an alternative controller alongside a simple and rigorous aggregation proof.

15:30-15:35, Paper TuCT10.4	Add to My Program
Learning Adversarial Policies for Swarm Leader Identification Using a Probing Agent

Bachoumas, Stergios	University of Delaware
Artemiadis, Panagiotis	University of Delaware
Keywords: Swarm Robotics Abstract: This study introduces a novel approach to swarm leader identification (SLI) in multi-agent robot systems by employing a physical adversary interacting with the swarm in the same environment. We develop a new simulation environment to study the SLI problem and train an adversary, which we term the prober, to solve the SLI problem using forceful interactions with the swarm as its guiding information source. The prober's policy is modeled using the simplified structure state space sequence (S5) model and trained with the Proximal Policy Optimization (PPO) algorithm. The prober only has access to the information on the relative positions of the other agents. We evaluate our approach through extensive simulations using two performance metrics and validate the sim-to-real transfer through robot experiments. Results on evaluating the performance in 10,000 different testing scenarios demonstrate that our method finds the leader's identity in the vast majority (95.7%) of the cases, regardless of the initial leader selection during training. The proposed system represents the first instance of learning-based automatic identification of leader agents in a swarm. This capability is crucial for enabling efficient and robust human-swarm interaction, understanding artificial swarm behaviors, and analyzing latent behaviors in biological swarms in nature.

15:35-15:40, Paper TuCT10.5	Add to My Program
Realizing Emergent Collective Behaviors through Robotic Swarmalators

Beattie, Richard	MIT
Ceron, Steven	Massachusetts Institute of Technology
Rus, Daniela	MIT
Keywords: Cellular and Modular Robots, Swarm Robotics, Multi-Robot Systems Abstract: Swarmalators move as a function of their pairwise phase interactions, and control their phase as a function of their relative position or motion to other agents. This enables dual sync and swarm behaviors that mimic those exhibited by diverse natural and artificial swarms; these behaviors have almost entirely been explored only through computational simulations. Here, we realize through a 15-robot collective many of the predicted swarmalator behaviors when agents are chiral and non-chiral, when there is frequency coupling, and when the natural frequency distribution is homogeneous and heterogeneous. This work presents an experimental platform that can realize many theoretically predicted collective behaviors, it sheds light on the differences between the simulations and experiments, and it will serve in future studies to realize swarmalator and active matter collective behaviors.

15:40-15:45, Paper TuCT10.6	Add to My Program
Speed and Density Planning for a Speed-Constrained Robot Swarm through a Virtual Tube

Song, Wenqi	Beihang University
Gao, Yan	School of Automation Science and Electrical Engineering, Beihang
Quan, Quan	Beihang University
Keywords: Constrained Motion Planning, Motion Control, Multi-Robot Systems Abstract: The planning and control of a robot swarm in a complex environment has attracted increasing attention. To this end, the concept of virtual tubes has been taken up in our previous work. Specifically, a virtual tube with varying widths has been planned to avoid collisions with obstacles in a complex environment. Based on the planned virtual tube for a large number of speed-constrained robots, the average forward speed and density along the virtual tube are further planned in this paper to improve safety and efficiency. Compared with the existing methods, the proposed method is founded upon global information and can is applicable to traversing confined spaces for speed-constrained robot swarms. Numerical simulations and experiments are conducted to show that the safety and efficiency of the passing-through process are improved. A video about the simulations and experiments is available at https://youtu.be/F3Xg1vUcxws.


TuCT11 Regular Session, 314	Add to My Program
Human-Robot Collaboration 1

Chair: Rocco, Paolo	Politecnico Di Milano
Co-Chair: Stepputtis, Simon	Virginia Tech

15:15-15:20, Paper TuCT11.1	Add to My Program
Let Me Help You! Neuro-Symbolic Short-Context Action Anticipation

Bhagat, Sarthak	Scaled Foundations
Li, Samuel	Carnegie Mellon University
Campbell, Joseph	Purdue University
Xie, Yaqi	Carnegie Mellon University
Sycara, Katia	Carnegie Mellon University
Stepputtis, Simon	Carnegie Mellon University
Keywords: Intention Recognition, Human-Robot Collaboration, Visual Learning Abstract: In an era where robots become available to the general public, the applicability of assistive robotics extends across numerous aspects of daily life, including in-home robotics. This work presents a novel approach for such systems, leveraging long-horizon action anticipation from short-observation contexts. In an assistive cooking task, we demonstrate that predicting human intention leads to effective collaboration between humans and robots. Compared to prior approaches, our method halves the required observation time of human behavior before accurate future predictions can be made, thus, allowing for quick and effective task support from short contexts. To provide sufficient context in such scenarios, our proposed method analyzes the human user and their interaction with surrounding scene objects by imbuing the system with additional domain knowledge, encoding the scene object's affordances. We integrate this knowledge into a transformer-based action anticipation architecture, which alters the attention mechanism between different visual features by either boosting or attenuating the attention between them. Through this approach, we achieve an up to 9% improvement on two common action anticipation benchmarks, namely 50Salads and Breakfast. After predicting a sequence of future actions, our system selects an appropriate assistive action that is subsequently executed on a robot for a joint salad preparation task between a human and a robot.

15:20-15:25, Paper TuCT11.2	Add to My Program
Evaluating Robotic Performative Autonomy in Collaborative Contexts Impacted by Latency

Sousa Silva, Rafael	Colorado School of Mines
Smith, Cailyn	Colorado School of Mines
Ferreira Bezerra, Lara	Colorado School of Mines
Williams, Tom	Colorado School of Mines
Keywords: Human-Robot Collaboration, Space Robotics and Automation, Human-Centered Automation Abstract: Maintaining Situational Awareness (SA) is critical in space exploration contexts, yet made particularly difficult due to the presence of communication latency. In order to increase human SA without inducing cognitive overload, researchers have proposed Performative Autonomy (PA), in which robots intentionally interact at a lower level of autonomy than they are capable of. While researchers have demonstrated positive impacts of PA on team performance even under high latency, previous work on PA has not examined how the benefits of PA might be mediated by latency. In this work, we thus evaluate the impact of latency and PA on trust, SA, and human perceptions of robot intelligence and autonomy. Our results suggest that lower performed autonomy leads to increased cognitive load, especially when robot communication happens frequently and latency is present. In addition, we observe no effect of the PA strategies used within our experimental paradigm on SA, and instead find evidence that operating under high latency leads to negative perceptions of robots regardless of choice of PA strategy.

15:25-15:30, Paper TuCT11.3	Add to My Program
SYNERGAI: Perception Alignment for Human-Robot Collaboration

Chen, Yixin	National Key Laboratory of General Artificial Intelligence, BIGA
Zhang, Guoxi	Beijing Institute of General Artificial Intelligence
Zhang, Yaowei	BIGAI
Xu, Hongming	Beijing Institute for General Artificial Intelligence
Zhi, Peiyuan	Beijing Institute for General Artificial Intelligence
Li, Qing	Beijing Institute for General Artificial Intelligence
Huang, Siyuan	Beijing Institute for General Artificial Intelligence
Keywords: Domestic Robotics Abstract: Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real-world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG in real-time. In a zero-shot manner, SYNERGAI achieves comparable performance with the data-driven models in ScanQA. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9% in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68% on novel tasks by transferring the knowledge acquired during alignment.

15:30-15:35, Paper TuCT11.4	Add to My Program
Digital Model-Driven Genetic Algorithm for Optimizing Layout and Task Allocation in Human-Robot Collaborative Assemblies

Cella, Christian	Politecnico Di Milano
Robin, Matteo Bruce	Politecnico Di Milano
Faroni, Marco	Politecnico Di Milano
Zanchettin, Andrea Maria	Politecnico Di Milano
Rocco, Paolo	Politecnico Di Milano
Keywords: Human-Robot Collaboration, Design and Human Factors Abstract: This paper addresses the optimization of human-robot collaborative work-cells before their physical deployment. Most of the times, such environments are designed based on the experience of the system integrators, often leading to sub-optimal solutions. Accurate simulators of the robotic cell, accounting for the presence of the human as well, are available today and can be used in the pre-deployment. We propose an iterative optimization scheme where a digital model of the work-cell is updated based on a genetic algorithm. The methodology focuses on the layout optimization and task allocation, encoding both the problems simultaneously in the design variables handled by the genetic algorithm, while the task scheduling problem depends on the result of the upper-level one. The final solution balances conflicting objectives in the fitness function and is validated to show the impact of the objectives with respect to a baseline, which represents possible initial choices selected based on the human judgement.

15:35-15:40, Paper TuCT11.5	Add to My Program
Context-Aware Collaborative Pushing of Heavy Objects Using Skeleton-Based Intention Prediction

Solak, Gokhan	Italian Institute of Technology, Genoa
Giardini Lahr, Gustavo Jose	Hospital Israelita Albert Einstein
Ozdamar, Idil	HRI2 Lab., Istituto Italiano Di Tecnologia. Dept. of Informatics
Ajoudani, Arash	Istituto Italiano Di Tecnologia
Keywords: Human-Robot Collaboration, Intention Recognition, Physical Human-Robot Interaction Abstract: In physical human-robot interaction, force feedback has been the most common sensing modality to convey the human intention to the robot. It is widely used in admittance control to allow the human to direct the robot. However, it cannot be used in scenarios where direct force feedback is not available since manipulated objects are not always equipped with a force sensor. In this work, we study one such scenario: the collaborative pushing and pulling of heavy objects on frictional surfaces, a prevalent task in industrial settings. When humans do it, they communicate through verbal and non-verbal cues, where body poses, and movements often convey more than words. We propose a novel context-aware approach using Directed Graph Neural Networks to analyze spatio-temporal human posture data to predict human motion intention for non-verbal collaborative physical manipulation. Our experiments demonstrate that robot assistance significantly reduces human effort and improves task efficiency. The results indicate that incorporating posture-based context recognition, either together with or as an alternative to force sensing, enhances robot decision-making and control efficiency.

15:40-15:45, Paper TuCT11.6	Add to My Program
Interactive Distance Field Mapping and Planning to Enable Human-Robot Collaboration

Ali, Usama	Technische Hochschule Würzburg Schweinfurt
Wu, Lan	University of Technology Sydney
Mueller, Adrian	Friedrich-Alexander-Universität Erlangen-Nürnberg
Sukkar, Fouad	University of Technology Sydney
Kaupp, Tobias	Technical University of Applied Sciences Würzburg-Schweinfurt
Vidal-Calleja, Teresa A.	University of Technology Sydney
Keywords: Mapping, Human-Robot Collaboration Abstract: Human-robot collaborative applications require scene representations that are kept up-to-date and facilitate safe motions in dynamic scenes. In this letter, we present an interactive distance field mapping and planning (IDMP) framework that handles dynamic objects and collision avoidance through an efficient representation. We define interactive mapping and planning as the process of creating and updating the representation of the scene online while simultaneously planning and adapting the robot's actions based on that representation. The key aspect of this work is an efficient Gaussian Process field that performs incremental updates and handles dynamic objects reliably by identifying moving points via a simple and elegant formulation based on queries from a temporary latent model. In terms of mapping, IDMP is able to fuse point cloud data from single and multiple sensors, query the free space at any spatial resolution, and deal with moving objects without semantics. In terms of planning, IDMP allows seamless integration with gradient-based reactive planners facilitating dynamic obstacle avoidance for safe human-robot interactions. Our mapping performance is evaluated on both real and synthetic datasets. A comparison with similar state-of-the-art frameworks shows superior performance when handling dynamic objects and comparable or better performance in the accuracy of the computed distance and gradient field. Finally, we show how the framework can be used for fast motion planning in the presence of moving objects both in simulated and real-world scenes. An accompanying video, code, and datasets are made publicly available.


TuCT12 Regular Session, 315	Add to My Program
Calibration 3

Chair: Krovi, Venkat	Clemson University
Co-Chair: Yuan, Shenghai	Nanyang Technological University

15:15-15:20, Paper TuCT12.1	Add to My Program
A Stochastic Cloning Square-Root Information Filter with Accurate Feature Tracking for Visual-Inertial Odometry

Hu, Deshun	Harbin Institute of Technology
Keywords: Visual-Inertial SLAM, SLAM, Localization Abstract: In this work, we introduce an enhanced square-root information filter for visual-inertial odometry. This filter utilizes stochastic cloning, implemented via Gaussian elimination, to facilitate time offset calibration and feature anchor changes. By using single-precision numbers within the filter, we significantly reduce computational load and memory requirements. In addition, we employ a fast Mahalanobis distance test and block Householder triangulation to accelerate the calculations. To mitigate feature drift from frame-to-frame optical flow, we create keyframes at regular intervals and refine long-tracked features between them. We use affine optical flow to compensate for patch deformations induced by possible large spatial transformations between keyframes. An analytical approach to computing the affine transformation is proposed. Experiments conducted on real-world data show that the proposed method achieves state-of-the-art performance at a much faster speed.

15:20-15:25, Paper TuCT12.2	Add to My Program
Large-Scale UWB Anchor Calibration and One-Shot Localization Using Gaussian Process

Yuan, Shenghai	Nanyang Technological University
Lou, Boyang	Beijing University of Posts and Telecommunications
Nguyen, Thien-Minh	Nanyang Technological University
Yin, Pengyu	Nanyang Technological University
Li, Jianping	Nanyang Technological University
Xu, Xinhang	Nanyang Technological University
Cao, Muqing	Carnegie Mellon University
Xu, Jie	Harbin Institute of Technology
Chen, Siyu	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Range Sensing, Localization, Factory Automation Abstract: Ultra-wideband (UWB) is gaining popularity with devices like AirTags for precise home item localization but faces significant challenges when scaled to large environments like seaports. The main challenges are calibration and localization in obstructed conditions, which are common in logistics environments. Traditional calibration methods, dependent on line-of-sight (LoS), are slow, costly, and unreliable in seaports and warehouses, making large-scale localization a significant pain point in the industry. To overcome these challenges, we propose a UWB-LiDAR fusion-based calibration and one-shot localization framework. Our method uses Gaussian Processes to estimate anchor position from continuous-time LiDAR Inertial Odometry with sampled UWB ranges. This approach ensures accurate and reliable calibration with just one round of sampling in large-scale areas, I.e., 600x450 m². With the LoS issues, UWB-only localization can be problematic, even when anchor positions are known. We demonstrate that by applying a UWB-range filter, the search range for LiDAR loop closure descriptors is significantly reduced, improving both accuracy and speed. This concept can be applied to other loop closure detection methods, enabling cost-effective localization in large-scale warehouses and seaports. It significantly improves precision in challenging environments where UWB-only and LiDAR-Inertial methods fall short. We will open-source our datasets and calibration codes for community use.

15:25-15:30, Paper TuCT12.3	Add to My Program
Online Identification of Skidding Modes with Interactive Multiple Model Estimation

Salvi, Ameya	Clemson University
Ala, Pardha Sai Krishna	Clemson University
Smereka, Jonathon M.	U.S. Army TARDEC
Brudnak, Mark	US Army DEVCOM-GVSC
Gorsich, David	The U.S. Army Ground Vehicle Systems Center
Schmid, Matthias	Clemson University
Krovi, Venkat	Clemson University
Keywords: Field Robots, Failure Detection and Recovery, Calibration and Identification Abstract: Skid-steered wheel mobile robots (SSWMRs) operate in a variety of outdoor environments exhibiting motion behaviors dominated by the effects of complex wheel-ground interactions. Characterizing these interactions is crucial both from the immediate robot autonomy perspective (for motion prediction and control) as well as a long-term predictive maintenance and diagnostics perspective. An ideal solution entails capturing precise state measurements for decisions and controls, which is considerably difficult, especially in increasingly unstructured outdoor regimes of operations for these robots. In this milieu, a framework to identify pre-determined discrete modes of operation can considerably simplify the motion model identification process. To this end, we propose an interactive multiple model (IMM) based filtering framework to probabilistically identify predefined robot operation modes that could arise due to traversal in different terrains or loss of wheel traction.

15:30-15:35, Paper TuCT12.4	Add to My Program
RLCNet: A Novel Deep Feature-Matching-Based Method for Online Target-Free Radar-LiDAR Calibration

Luan, Kai	Intelligence Science and Technology，National University O
Shi, Chenghao	NUDT
Chen, Xieyuanli	National University of Defense Technology
Fan, Rui	Tongji University
Zheng, Zhiqiang	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Localization, Deep Learning for Visual Perception Abstract: While millimeter-wave radars are widely used in robotics and autonomous driving, extrinsic calibration with other sensors remains challenging due to the sparsity and uncertainty of radar point clouds. In this paper, we propose a novel deep feature-matching-based online extrinsic calibration approach for a 4D millimeter-wave radar and 3D LiDAR system. We formulate the calibration problem as a cross-modal point cloud registration task, initiating with keypointlevel matching followed by dense matching refinement. Efficient yet powerful neural networks are employed to extract prior keypoint matches, which are then expanded to surrounding regions, establishing dense point correspondences. Our approach effectively leverages the majority of the information from millimeter-wave radar, mitigating the impact of radar point cloud sparsity. We evaluate our approach on two datasets, and experimental results demonstrate that it outperforms state-of-the-art baseline methods and achieves an average improvement of 66.96% in calibration success rate, while reducing translational error and rotational error by 23.84% and 30.31%, respectively. Our implementation will be made open-source at https://github.com/nubot-nudt/RLCNet.

15:35-15:40, Paper TuCT12.5	Add to My Program
Universal Online Temporal Calibration for Optimization-Based Visual-Inertial Navigation Systems

Fan, Yunfei	ByteDance Inc
Zhao, Tianyu	Bytedance
Guo, Linan	China University of Mining and Technology (Beijing)
Chen, Chen	Bytedance Inc
Wang, Xin	Bytedance
Zhou, Fengyi	ByteDance Inc
Keywords: Visual-Inertial SLAM, Localization, Sensor Fusion Abstract: 6-Degree of Freedom (6DoF) motion estimation with a combination of visual and inertial sensors is a growing area with numerous real-world applications. However, precise calibration of the time offset between these two sensor types is a prerequisite for accurate and robust tracking. To address this, we propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems. Technically, we incorporate the time offset as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp using time offset, angular velocity and translational velocity. This allows the temporal misalignment to be optimized alongside other tracking states during the process. As our method only modifies the structure of the residual model, it can be applied to various optimization-based frameworks with different tracking frontends. We evaluate our calibration method with both EuRoC and simulation data and extensive experiments demonstrate that our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.The experimental code is available at https://github.com/bytedance/Ts_Online_Optimization.

15:40-15:45, Paper TuCT12.6	Add to My Program
Multi-Camera Hand-Eye Calibration for Human-Robot Collaboration in Industrial Robotic Workcells

Allegro, Davide	University of Padova
Terreran, Matteo	University of Padova
Ghidoni, Stefano	University of Padova
Keywords: Calibration and Identification, Sensor Networks, Human-Robot Collaboration Abstract: In industrial scenarios, effective human-robot collaboration relies on multi-camera systems to robustly monitor human operators despite the occlusions that typically show up in a robotic workcell. In this scenario, precise localization of the person in the robot coordinate system is essential, making the hand-eye calibration of the camera network critical. This process presents significant challenges when high calibration accuracy should be achieved in short time to minimize production downtime, and when dealing with extensive camera networks used for monitoring wide areas, such as industrial robotic workcells. Our paper introduces an innovative and robust multi-camera hand-eye calibration method, designed to optimize each camera’s pose relative to both the robot’s base and to each other camera. This optimization integrates two types of key constraints: i) a single board-to-end-effector transformation, and ii) the relative camera-to-camera transformations. We demonstrate the superior performance of our method through comprehensive experiments employing the METRIC dataset and real-world data collected on industrial scenarios, showing notable advancements over state-of-the-art techniques even using less than 10 images. Additionally, we release an open-source version of our multi-camera hand-eye calibration algorithm at https://github.com/davidea97/Multi-Camera-Hand-Eye-Calibration.git.

15:45-15:50, Paper TuCT12.7	Add to My Program
EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless LiDAR-Camera Calibration

Li, Xingchen	University of Science and Technology of China
Duan, Yifan	University of Science and Technology of China
Wang, Beibei	Hefei Comprehensive National Science Center
Ren, Haojie	University of Science and Technology of China
You, Guoliang	University of Science and Technology of China
Sheng, Yu	University of Science and Technology of China
Ji, Jianmin	University of Science and Technology of China
Zhang, Yanyong	University of Science and Technology of China
Keywords: Calibration and Identification, Sensor Fusion, Multi-Modal Perception for HRI Abstract: In multimodal perception systems, achieving precise extrinsic calibration between LiDAR and camera is of critical importance. However, the pre-calibrated extrinsic parameters may gradually drift during operation, leading to a decrease in the accuracy of the perception system. It is challenging to address this issue using methods based on artificial targets. In this article, we introduce an edge-based approach for automatic targetless calibration of LiDARs and cameras in real-world scenarios. The edge features, which are prevalent in various environments, are used to establish reliable correspondences in images and point clouds. Specifically, we leverage the Segment Anything Model to facilitate the extraction of stable and reliable image edge features. Then a multi-frame weighting strategy is used for feature filtering while alleviating the dependence on the environment. Finally, we estimate accurate extrinsic parameters based on edge correspondence constraints. Our method achieves a mean rotation error of 0.069 ◦ and a mean translation error of 1.037 cm on the KITTI dataset, outperforming existing edge-based calibration methods and demonstrating strong robustness, accuracy, and generalization capabilities.


TuCT13 Regular Session, 316	Add to My Program
Radiance Fields for Manipulation

Chair: Fazeli, Nima	University of Michigan
Co-Chair: Shkurti, Florian	University of Toronto

15:15-15:20, Paper TuCT13.1	Add to My Program
Gaussian Splatting Visual MPC for Granular Media Manipulation

Tseng, Wei-Cheng	University of Toronto
Zhang, Ellina	University of Toronto
Jatavallabhula, Krishna Murthy	MIT
Shkurti, Florian	University of Toronto
Keywords: Manipulation Planning, AI-Based Methods, AI-Enabled Robotics Abstract: Recent advancements in learned 3D representations have enabled significant progress in solving complex robotic manipulation tasks, particularly for rigid-body objects. However, manipulating granular materials such as beans, nuts, and rice, remains challenging due to the intricate physics of particle interactions, high-dimensional and partially observable state, inability to visually track individual particles in a pile, and the computational demands of accurate dynamics prediction. Current deep latent dynamics models often struggle to generalize in granular material manipulation due to a lack of inductive biases. In this work, we propose a novel approach that learns a visual dynamics model over Gaussian splatting representations of scenes and leverages this model for manipulating granular media via Model-Predictive Control. Our method enables efficient optimization for complex manipulation tasks on piles of granular media. We evaluate our approach in both simulated and real-world settings, demonstrating its ability to solve unseen planning tasks and generalize to new environments in a zero-shot transfer. We also show significant prediction and manipulation performance improvements compared to existing granular media manipulation methods.

15:20-15:25, Paper TuCT13.2	Add to My Program
LE-Object: Language Embedded Object-Level Neural Radiance Fields for Open-Vocabulary Scene

Wang, Mengting	Northeastern University
Zhang, Yunzhou	Northeastern University
Wang, Xingshuo	Northeastern University
Zhang, Zhiyao	Northeastern University
Li, Zhiteng	Northeastern University
Keywords: Semantic Scene Understanding, Deep Learning Methods, Mapping Abstract: Recent advancements in Visual Language Models (VLMs) have significantly driven research in open-vocabulary 3D scene reconstruction, showcasing strong potential in open-set retrieval and semantic understanding. However, existing approaches face challenges in open-world environments: they either suffer from insufficient precision in semantic segmentation, leading to inadequate fine-grained scene understanding, or they are limited to object-level reconstruction, failing to capture intricate object details and lack applicability in open-world settings. To address these issues, we introduce LE-Object, an object-centric Neural Implicit Radiance Field (NeRF) method designed for open-world scenarios, aimed at achieving fine-grained scene understanding and high-fidelity object reconstruction. LE-Object integrates spatial features (SF) from object point clouds with visual features (VF) from VLMs to perform object association, ensuring spatiotemporal consistency in object mask segmentation, and extends VLM features from 2D images into 3D space, enabling precise open-world semantic inference and detailed object reconstruction. Experimental results demonstrate that LE-Object excels in zero-shot semantic segmentation and open-world object reconstruction, offering innovative solutions for global navigation and local object manipulation in open-world applications.

15:25-15:30, Paper TuCT13.3	Add to My Program
TranSplat: Surface Embedding-Guided 3D Gaussian Splatting for Transparent Object Manipulation

Kim, Jeongyun	SNU
Noh, Jeongho	Seoul National University
Lee, DongGuw	Seoul National University (SNU)
Kim, Ayoung	Seoul National University
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception, Deep Learning in Grasping and Manipulation Abstract: Transparent object manipulation remains a significant challenge in robotics due to the difficulty of acquiring accurate and dense depth measurements. Conventional depth sensors often fail with transparent objects, resulting in incomplete or erroneous depth data. Existing depth completion methods struggle with interframe consistency and incorrectly model transparent objects as Lambertian surfaces, leading to poor depth reconstruction. To address these challenges, we propose TranSplat, a surface embedding-guided 3D Gaussian Splatting method tailored for transparent objects. TranSplat uses a latent diffusion model to generate surface embeddings that provide consistent and continuous representations, making it robust to changes in viewpoint and lighting. By integrating these surface embeddings with input RGB images, TranSplat effectively captures the complexities of transparent surfaces, enhancing the splatting of 3D Gaussians and improving depth completion. Evaluations on synthetic and real-world transparent object benchmarks, as well as robot grasping tasks, show that TranSplat achieves accurate and dense depth completion, demonstrating its effectiveness in practical applications. We open-source synthetic dataset and model: https://github.com/jeongyun0609/TranSplat

15:30-15:35, Paper TuCT13.4	Add to My Program
NeuGrasp: Generalizable Neural Surface Reconstruction with Background Priors for Material-Agnostic Object Grasp Detection

Fan, Qingyu	University of Chinese Academy of Sciences
Cai, Yinghao	Institute of Automation, Chinese Academy of Sciences
Li, Chao	Qiyuan Lab
He, Wenzhe	Chongqing University
Zheng, Xudong	Qiyuan Lab
Lu, Tao	Institute of Automation, Chinese Academy of Sciences
Liang, Bin	Qiyuan Lab
Wang, Shuo	Chinese Academy of Sciences
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Grasping Abstract: Robotic grasping in cluttered environments with diverse materials, including transparent and specular surfaces, poses challenges for conventional depth-sensing methods. We introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encoding, enabling robust surface reconstruction even in highly narrow and sparse viewing conditions. Our innovative use of background priors enhances focus on foreground objects via residual feature enhancement and refines spatial perception with an occupancy-prior volume, particularly for transparent and specular objects. Extensive experiments in both simulated and real-world settings show NeuGrasp significantly outperforms state-of-the-art methods in grasping while maintaining comparable reconstruction quality. Moreover, NeuGrasp-RA (Reality Augmentation), a fine-tuned variant with small-scale real-world data, demonstrates strong domain adaptation, proving its robustness in practical scenarios.

15:35-15:40, Paper TuCT13.5	Add to My Program
Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting

Strong, Matthew	University of Colorado Boulder
Lei, Boshu	University of Pennsylvania
Swann, Aiden	Stanford
Jiang, Wen	University of Pennsylvania
Daniilidis, Kostas	University of Pennsylvania
Kennedy, Monroe	Stanford University
Keywords: Perception for Grasping and Manipulation, Reactive and Sensor-Based Planning, Semantic Scene Understanding Abstract: We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at https://arm.stanford.edu/next-best-sense.

15:40-15:45, Paper TuCT13.6	Add to My Program
Persistent Object Gaussian Splat (POGS) for Tracking Human and Robot Manipulation of Irregularly Shaped Objects

Yu, Justin	University of California Berkeley
Hari, Kush	UC Berkeley
El-Refai, Karim	University of California, Berkeley
Dalal, Arnav	University of California - Berkeley
Kerr, Justin	University of California, Berkeley
Kim, Chung Min	University of California, Berkeley
Cheng, Richard	California Institute of Technology
Irshad, Muhammad Zubair	Toyota Research Institute
Goldberg, Ken	UC Berkeley
Keywords: Perception for Grasping and Manipulation, Visual Tracking, Visual Servoing Abstract: Tracking and manipulating irregularly-shaped, previously unseen objects in dynamic environments is important for robotic applications in manufacturing, assembly, and logistics. Recently introduced Gaussian Splats efficiently model object geometry, but lack persistent state estimation for task-oriented manipulation. We present Persistent Object Gaussian Splat (POGS), a system that embeds semantics, self-supervised visual features, and object grouping features into a compact representation that can be continuously updated to estimate the pose of scanned objects. POGS updates object states without requiring expensive rescanning or prior CAD models of objects. After an initial multi-view scene capture and training phase, POGS uses a single stereo camera to integrate depth estimates along with self-supervised vision encoder features for object pose estimation. POGS supports grasping, reorientation, and natural language-driven manipulation by refining object pose estimates, facilitating sequential object reset operations with human-induced object perturbations and tool servoing, where robots recover tool pose despite tool perturbations of up to 30°. POGS achieves up to 12 consecutive successful object resets and recovers from 80% of in-grasp tool perturbations.

15:45-15:50, Paper TuCT13.7	Add to My Program
Tactile Functasets: Neural Implicit Representations of Tactile Datasets

Li, Sikai	University of Michigan
Rodriguez, Samanta	University of Michigan - Ann Arbor
Dou, Yiming	University of Michigan
Owens, Andrew	University of Michigan
Fazeli, Nima	University of Michigan
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Force and Tactile Sensing Abstract: Modern incarnations of tactile sensors produce raw, high-dimensional data such as images, making it challenging to efficiently process and generalize across sensors. In this paper, we introduce a novel representation for tactile sensor feedback based on neural implicit functions. Rather than directly using raw tactile images, we propose neural implicit functions trained to reconstruct the tactile dataset, producing compact neural representations that capture the underlying structure of the sensory inputs. These representations offer several advantages over their raw counterparts: they are compact, enable probabilistically interpretable inference, and facilitate generalization across different sensors. We demonstrate the efficacy of this representation on the downstream task of in-hand object pose estimation, achieving improved performance over image-based methods while simplifying downstream models.


TuCT14 Regular Session, 402	Add to My Program
Tracking and Prediction 2

Chair: Saska, Martin	Czech Technical University in Prague
Co-Chair: Kyriakopoulos, Kostas	New York University - Abu Dhabi

15:15-15:20, Paper TuCT14.1	Add to My Program
I2D-Loc++: Camera Pose Tracking in LiDAR Maps with Multi-View Motion Flows

Yu, Huai	Wuhan University
Chen, Kuangyi	Graz University of Technology
Yang, Wen	Wuhan University
Scherer, Sebastian	Carnegie Mellon University
Xia, Gui-Song	Wuhan University
Keywords: Localization, SLAM Abstract: Camera localization in LiDAR maps has become increasingly popular due to its promising ability to handle complex scenarios, surpassing the limitations of visual-only localization methods. However, existing approaches mostly focus on addressing the cross-modal 2D-3D gaps while overlooking the relationship between adjacent image frames, which results in fluctuations and unreliability of camera poses. To alleviate this, we introduce a novel camera pose tracking framework in LiDAR maps by coupling the 2D-3D correspondences with 2D-2D feature matching (I2D-Loc++), which establishes the multi-view geometric constraints to improve localization stability and trajectory smoothness. Specifically, the framework consists of a front-end hybrid flow estimation network and a non-linear least square pose optimization module. We further design a cross-modal consistency loss to integrate the multi-view motion flows for the network training and the back-end pose optimization. The pose tracking model is trained on the KITTI odometry dataset, and tested on the KITTI odometry, Argoverse, Waymo and Lyft5 datasets, which demonstrates that I2D-Loc++ has superior performance and good generalization ability in improving the accuracy and robustness of camera pose tracking. Our code, pre-trained models, and online demos are available at https://github.com/EasonChen99/2D3DPoseTracking

15:20-15:25, Paper TuCT14.2	Add to My Program
LoFSORT: Sample Online and Real-Time Tracking in Low Frame Rate Scenarios

Wang, Jiabao	Korea Advanced Institute of Science & Technology
Chang, Dong Eui	KAIST
Keywords: Visual Tracking, Computer Vision for Automation Abstract: We propose a novel motion-based tracker specifically designed for tracking multiple people in low frame rate scenarios. While previous studies have predominantly focused on scenarios with high frame rates (exceeding 10 frames per second), tracking in low frame rate conditions is significant for robotic platforms with limited computational resources. Our tracker optimizes the cost function, cascade structure and Kalman filter correction to better adapt to the characteristics of low frame rate environments. First, we enhance the cost function by incorporating stable variables through the introduction of height-based and displacement-based cost terms. Second, we prioritize handling occlusion among individuals during association, which reduces ambiguity in subsequent tracking processes. Third, we utilize the error-compensated observation to correct the Kalman filter, thereby improving tracking accuracy. Experimental results demonstrate that our proposed tracker, LoFSORT, outperforms other motion model-based trackers across various frame rate scenarios. Ablation studies further confirm that each component of our tracker enhances tracking performance in low frame rate scenarios

15:25-15:30, Paper TuCT14.3	Add to My Program
Multirotor Target Tracking through Policy Iteration for Visual Servoing

Aspragkathos, Sotiris	SingularLogic S.A
Rousseas, Panagiotis	National Technical University of Athens
Karras, George	University of Thessaly
Kyriakopoulos, Kostas	New York University - Abu Dhabi
Keywords: Visual Servoing, Visual Tracking, Optimization and Optimal Control Abstract: This paper presents a novel vision-based approach for tracking deformable contour targets using Unmanned Aerial Vehicles (UAVs) through combining image moments descriptor and a Policy Iteration scheme ensuring stability and generalization of knowledge to new tasks. This computationally efficient and optimal control scheme is suitable for diverse dynamic environments such as the surveillance and tracking of targets with evolving features. Due to the ability of the proposed scheme to comprehend an optimization output, the generated control sequence, from an offline successively approximated policy, makes the process less challenging. The proposed methodology is validated through extensive simulations and real-word experiments of environmental target surveillance using an octorotor UAV.

15:30-15:35, Paper TuCT14.4	Add to My Program
BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data

Huang, Kemiao	Southern University of Science and Technology
Chen, Yinqi	Southern University of Science and Technology
Zhang, Meiying	Southern University of Science and Technology
Hao, Qi	Southern University of Science and Technology
Keywords: Visual Tracking, Sensor Fusion Abstract: Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes ``BiTrack'', a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.

15:35-15:40, Paper TuCT14.5	Add to My Program
ConTrack3D: Contrastive Learning Contributes Concise 3D Multi-Object Tracking

Du, Ruibin	Fudan University
Ding, Ziheng	Fudan University
Zhang, Xiaze	Fudan University
Wang, Zhuoyao	Fudan University
Cheng, Ying	Fudan University
Feng, Rui	Fudan University
Keywords: Visual Tracking, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: Online object detection and tracking are crucial for embodied intelligence systems, including autonomous vehicles and robotics. Traditional approaches employ a pipeline structure to perform detection and tracking separately, which can not fully leverage the information from the detector. Moreover, most prior tracking methods rely on motion models such as constant velocity for state updates, which can lead to incorrect associations when the velocity estimates are inaccurate. To address these limitations, we propose ConTrack3D, an end-to-end framework that jointly performs detection and tracking in a fully online manner. Specifically, ConTrack3D incorporates a Joint Encoder module to capture detection embeddings and a Temporal Extender module for data-driven state updates. By employing contrastive learning, ConTrack3D learns discriminative tracking representations for more accurate associations. ConTrack3D is evaluated on the nuScenes benchmark, and the experimental results demonstrate its significant improvements in tracking performance.

15:40-15:45, Paper TuCT14.6	Add to My Program
LMH-MOT : A Light Multiple Hypothesis Framework for 3D Multi-Object Tracking

Yuan, Tanghu	Tongji University
Yang, Mengxiang	Northeastern University
Keywords: Visual Tracking Abstract: 3D multi-object tracking (3D MOT) is a key area in the field of autonomous driving. In systems that track by detection, the detection results of deep learning models will inevitably have FP(False Positives) and FN(False Nagatives), and detector always cannot continuously and accurately detect targets when facing obstacle occlusion and sensor blind spots. The task of 3D-MOT is to combine the discrete and disordered target detection results in time sequence into continuous and reliable tracks for use by downstream planning modules. At present, multi-target tracking algorithms in the field of autonomous driving are all based on single-hypothesis. In crowded scenarios, both false negatives (FN) and false positives (FP) significantly increase, making it difficult for single-hypothesis-based tracking algorithms to accurately output tracks. Towards this end, we propose LMH-MOT, a light multiple hypothesis framework for 3D MOT. Specifically, LMH-MOT effectively handles complex data association problems in autonomous driving scenarios by generating and maintaining multiple sets of hypotheses. Recognizing the possibility of switching between different motion states of the object, we use multiple motion models to more accurately estimate the motion state of the same object at the same time, and select the best estimation result for output. Additionally, we introduce a data association method based on decision trees, making full use of various features of the track and greatly reducing false matches and missing matches. In order to ensure the real-time performance of the entire algorithm framework, we also use gibbs sampling to significantly reduce the calculation time. On the NuScenes dataset, our proposed method achieves state-of-the-art performance with 76.2% AMOTA.

15:45-15:50, Paper TuCT14.7	Add to My Program
Towards Safe Mid-Air Drone Interception: Strategies for Tracking & Capture

Pliska, Michal	Czech Technical University in Prague, Faculty of Electrical Engi
Vrba, Matous	Faculty of Electrical Engineering, Czech Technical University In
Baca, Tomas	Ceske Vysoke Uceni Technicke V Praze, FEL
Saska, Martin	Czech Technical University in Prague
Keywords: Aerial Systems: Perception and Autonomy, Reactive and Sensor-Based Planning, Field Robots Abstract: A unique approach for the mid-air autonomous aerial interception of non-cooperating Unmanned Aerial Vehicles by a flying robot equipped with a net is presented in this paper. A novel interception guidance method dubbed Fast Response Proportional Navigation (FRPN) is proposed, designed to catch agile maneuvering targets while relying on onboard state estimation and tracking. The proposed method is compared with state-of-the-art approaches in simulations using 100 different trajectories of the target with varying complexity comprising almost 14 hours of flight data, and FRPN demonstrates the shortest response time and the highest number of interceptions, which are key parameters of agile interception. To enable robust transfer from theory and simulation to a real-world implementation, we aim to avoid overfitting to specific assumptions about the target and to tackle interception of a target following an unknown general trajectory. Furthermore, we identify several often overlooked problems related to tracking and estimation of the target's state that can have a significant influence on the overall performance of the system. We propose the use of a novel state estimation filter based on the Interacting Multiple Model filter and a new measurement model. Simulated experiments show that the proposed solution provides significant improvements in estimation accuracy over the commonly employed Kalman Filter approaches when considering general trajectories. Based on these results, we employ the proposed filtering and guidance methods to implement a complete autonomous interception system, which is thoroughly evaluated in realistic simulations and tested in real-world experiments with a maneuvering target, going far beyond the performance of any state-of-the-art solution.


TuCT15 Regular Session, 403	Add to My Program
Robot Mapping 2

Chair: Ghaffari, Maani	University of Michigan
Co-Chair: Guo, Yuliang	Bosch Research North America

15:15-15:20, Paper TuCT15.1	Add to My Program
RISED: Accurate and Efficient RGB-Colorized Mapping Using Image Selection and Point Cloud Densification

Jiang, Changjian	Zhengjiang University
Wang, Lijie	Zhejiang University
Wan, Zeyu	Zhejiang University
Gao, Ruilan	Zhejiang University
Wang, Yue	Zhejiang University
Xiong, Rong	Zhejiang University
Zhang, Yu	Zhejiang University
Keywords: Mapping, SLAM, Sensor Fusion Abstract: Recent advances in robotics have underscored the critical role of colorized point clouds in enhancing environmental perception accuracy. However, conventional multi-sensor fusion Simultaneous Localization and Mapping (SLAM) systems typically employ all available images indiscriminately for point cloud colorization, resulting in suboptimal outcomes with blurred textures. Notably, achieving precise texture-to-geometry alignment remains a challenge despite the availability of accurate pose estimation. This study introduces RISED, an advanced colorized mapping system that tackles this challenge from two perspectives: projection accuracy and distribution uniformity. For projection accuracy, we analyze the influence of camera poses on colorization and carefully select the optimal viewpoint to minimize errors. Regarding distribution uniformity, point cloud densification is applied to eliminate LiDAR scanning traces. Furthermore, a novel evaluation method is introduced to provide comprehensive assessment of colorized point clouds, filling a gap in this field. Experimental results show that our method outperforms traditional approaches in RGB-colorized mapping. Specifically, our method achieves notable improvements in projection accuracy (55.2%), geometric accuracy (63.1%), and surface coverage (30.8%).

15:20-15:25, Paper TuCT15.2	Add to My Program
Modeling Uncertainty in 3D Gaussian Splatting through Continuous Semantic Splatting

Wilson, Joseph	University of Michigan
Almeida, Marcelino	Amazon Lab126
Mahajan, Sachit	Amazon
Sun, Min	National Tsing Hua University
Ghaffari, Maani	University of Michigan
Ewen, Parker	University of Michigan
Ghasemalizadeh, Omid	Amazon Lab126
Kuo, Cheng-Hao	Amazon
Sen, Arnab	Amazon
Keywords: Mapping, Probabilistic Inference, Deep Learning for Visual Perception Abstract: In this paper, we present a novel algorithm for probabilistically updating and rasterizing semantic maps within 3D Gaussian Splatting (3D-GS). Although previous methods have introduced algorithms which learn to rasterize features in 3D-GS for enhanced scene understanding, 3D-GS can fail without warning which presents a challenge for safety-critical robotic applications. To address this gap, we propose a method which advances the literature of continuous semantic mapping from voxels to ellipsoids, combining the precise structure of 3D-GS with the ability to quantify uncertainty of probabilistic robotic maps. Given a set of images, our algorithm performs a probabilistic semantic update directly on the 3D ellipsoids to obtain an expectation and variance through the use of conjugate priors. We also propose a probabilistic rasterization which returns per-pixel segmentation predictions with quantifiable uncertainty. We compare our method with similar probabilistic voxel-based methods to verify our extension to 3D ellipsoids, and perform ablation studies on uncertainty quantification and temporal smoothing.

15:25-15:30, Paper TuCT15.3	Add to My Program
OG-Gaussian: Occupancy Based Street Gaussians for Autonomous Driving

Shen, Yedong	University of Science & Technology of China
Zhang, Xinran	University of Science and Technology of China
Duan, Yifan	University of Science and Technology of China
Zhang, Shiqi	USTC
Li, Heng	University of Science and Technology of China
Wu, Yilong	University of Science and Technology of China
Ji, Jianmin	University of Science and Technology of China
Zhang, Yanyong	University of Science and Technology of China
Jin, Huiqing	National Center of Engineering and Technology for Vehicle Drivin
Keywords: Mapping, Computer Vision for Automation Abstract: Accurate and realistic 3D scene reconstruction enables the lifelike creation of autonomous driving simulation environments. With advancements in 3D Gaussian Splatting (3DGS), previous studies have applied it to reconstruct complex dynamic driving scenes. These methods typically require expensive LiDAR sensors and pre-annotated datasets of dynamic objects. To address these challenges, we propose OG-Gaussian, a novel approach that replaces LiDAR point clouds with Occupancy Grids (OGs) generated from surround-view camera images using Occupancy Prediction Network (ONet). Our method leverages the semantic information in OGs to separate dynamic vehicles from static street background, converting these grids into two distinct sets of initial point clouds for reconstructing both static and dynamic objects. Additionally, we estimate the trajectories and poses of dynamic objects through a learning-based approach, eliminating the need for complex manual annotations. Experiments on Waymo Open dataset demonstrate that OG-Gaussian is on par with the current state-of-the-art in terms of reconstruction quality and rendering speed, achieving an average PSNR of 35.13 and a rendering speed of 143 FPS, while significantly reducing computational costs and economic overhead.

15:30-15:35, Paper TuCT15.4	Add to My Program
SMART: Advancing Scalable Map Priors for Driving Topology Reasoning

Ye, Junjie	University of Southern California
Paz, David	University of California, San Diego
Zhang, Hengyuan	University of California, San Diego
Guo, Yuliang	Bosch Research North America
Huang, Xinyu	Robert Bosch LLC
Christensen, Henrik Iskov	UC San Diego
Wang, Yue	USC
Ren, Liu	Robert Bosch North America Research Technology Center
Keywords: Mapping, Computer Vision for Transportation Abstract: Topology reasoning is crucial for autonomous driving as it enables comprehensive understanding of connectivity and relationships between lanes and traffic elements. While recent approaches have shown success in perceiving driving topology using vehicle-mounted sensors, their scalability is hindered by the reliance on training data captured by consistent sensor configurations. We identify that the key factor in scalable lane perception and topology reasoning is the elimination of this sensor-dependent feature. To address this, we propose SMART, a scalable solution that leverages easily available standard-definition (SD) and satellite maps to learn a map prior model, supervised by large-scale geo-referenced high-definition (HD) maps independent of sensor settings. Attributing to scaled training, SMART alone achieves superior offline lane topology understanding using only SD and satellite inputs. Extensive experiments further demonstrate that SMART can be seamlessly integrated into any online topology reasoning method, yielding significant improvements by up to 28% on the OpenLane-V2 benchmark. Project page: https://jay-ye.github.io/smart.

15:35-15:40, Paper TuCT15.5	Add to My Program
DynORecon: Dynamic Object Reconstruction for Navigation

Wang, Yiduo	University of Sydney
Morris, Jesse	University of Sydney
Wu, Lan	University of Technology Sydney
Vidal-Calleja, Teresa A.	University of Technology Sydney
Ila, Viorela	The University of Sydney
Keywords: Mapping, Vision-Based Navigation, Motion and Path Planning Abstract: This paper presents DynORecon, a Dynamic Object Reconstruction system that leverages the information provided by Dynamic SLAM to simultaneously generate a volumetric map of observed moving entities while estimating free space to support navigation. By capitalising on the motion estimations provided by Dynamic SLAM, DynORecon continuously refines the representation of dynamic objects to eliminate residual artefacts from past observations and incrementally reconstructs each object, seamlessly integrating new observations to capture previously unseen structures. Our system is highly efficient (∼20 FPS) and produces accurate (∼10 cm) object reconstructions using simulated and real-world outdoor datasets.

15:40-15:45, Paper TuCT15.6	Add to My Program
Ephemerality Meets LiDAR-Based Lifelong Mapping

Gil, Hyeonjae	SNU
Lee, Dongjae	Seoul National University
Kim, Giseop	DGIST (Daegu Gyeongbuk Institute of Science and Technology)
Kim, Ayoung	Seoul National University
Keywords: Mapping, SLAM, Range Sensing Abstract: Lifelong mapping is crucial for the long-term deployment of robots in dynamic environments. In this paper, we present ELite, an ephemerality-aided LiDAR-based lifelong mapping framework which can seamlessly align multiple session data, remove dynamic objects, and update maps in an end-to-end fashion. Map elements are typically classified as static or dynamic, but cases like parked cars indicate the need for more detailed categories than binary. Central to our approach is the probabilistic modeling of the world into two-stage ephemerality, which represent the transiency of points in the map within two different time scales. By leveraging the spatiotemporal context encoded in ephemeralities, ELite can accurately infer transient map elements, maintain a reliable up-to-date static map, and improve robustness in aligning the new data in a more fine-grained manner. Extensive real-world experiments on long-term datasets demonstrate the robustness and effectiveness of our system. The source code is publicly available for the robotics community: https://github.com/dongjae0107/ELite.

15:45-15:50, Paper TuCT15.7	Add to My Program
Addressing Diverging Training Costs Using BEVRestore for High-Resolution Bird's Eye View Map Construction

Kim, Minsu	KAIST
Kim, Giseop	DGIST (Daegu Gyeongbuk Institute of Science and Technology)
Choi, Sunwook	NAVER LABS Corp
Keywords: Sensor Fusion, Mapping, Range Sensing Abstract: Recent advancements in Bird’s Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky motion planning like collision avoidance, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel BEVRestore mechanism. Specifically, our proposed model encodes the features of each sensor to LR BEV space and restores them to HR space to establish a memory-efficient map constructor. To this end, we introduce the BEV restoration strategy, which restores aliasing, and blocky artifacts of the up- scaled BEV features, and narrows down the width of the labels. Our extensive experiments show that the proposed mechanism provides a plug-and-play, memory-efficient pipeline, enabling an HR map construction with a broad BEV scope. Our code is available at https://github.com/minshu-kim/BEVRestore.


TuCT16 Regular Session, 404	Add to My Program
Manipulation 3

Chair: Bhirangi, Raunaq Mahesh	New York University
Co-Chair: Desingh, Karthik	University of Minnesota

15:15-15:20, Paper TuCT16.1	Add to My Program
Smaller and Faster Robotic Grasp Detection Model Via Knowledge Distillation and Unequal Feature Encoding

Nie, Hong	Shanxi University
Zhao, Zhou	Central China Normal University
Chen, Lu	Shanxi University
Lu, Zhenyu	South China University of Technology
Li, Zhuomao	Shanxi University
Yang, Jing	Shanxi University
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: In order to achieve higher accuracy, the complexity of grasp detection network increases accordingly with complicated model structures and tremendous parameters. Although various light-weight strategies are adopted, directly designing the compact network can be sub-optimal and difficult to strike the balance between accuracy and model size. To solve this problem, we explore a more efficient grasp detection model from two aspects: elaborately designing a light-weight network and performing knowledge distillation on the designed network. Specifically, based on the designed light-weight backbone, the features from RGB and D images with unequal effective grasping information rates are fully utilized and the information compensation strategies are adopted to make the model small enough while maintaining its accuracy. Then, the grasping features contained in the large teacher model are adaptively and effectively learned by our proposed method via knowledge distillation. Experimental results indicate that the proposed method is able to achieve comparable performance (98.9%, 93.1%, 82.3%, and 90.0% on Cornell, Jacquard, GraspNet, and MultiObj datasets respectively) with more complicate models while reducing the parameters from MBs to KBs. Real-world robotic grasping experiment in an embedded AI computing device also prove the effectiveness of this approach.

15:20-15:25, Paper TuCT16.2	Add to My Program
ViViDex: Learning Vision-Based Dexterous Manipulation from Human Videos

Chen, Zerui	ENS Paris, France
Chen, Shizhe	Inria
Arlaud, Etienne	INRIA
Laptev, Ivan	INRIA
Schmid, Cordelia	Inria
Keywords: Dexterous Manipulation, Learning from Demonstration, Sensor-based Control Abstract: In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.

15:25-15:30, Paper TuCT16.3	Add to My Program
HuDOR: Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards

Guzey, Irmak	New York University
Dai, Yinlong	NYU
Savva, Georgy	New York University
Bhirangi, Raunaq Mahesh	New York University
Pinto, Lerrel	New York University
Keywords: Dexterous Manipulation, Imitation Learning, Reinforcement Learning Abstract: Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks without teleoperation remains a difficult problem for multi-fingered robot hands. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand with a different morphology. In this work, we present HuDOR, a technique that enables online finetuning of the policy by constructing a reward function from the human video. Importantly, this reward function is built using object-oriented rewards derived from off-the-shelf point trackers, which allows for meaningful learning signals even when the robot hand is in the visual observation, while the human hand is used to construct the reward. Given a single video of human solving a task, such as gently opening a music box, HuDOR allows our four-fingered Allegro hand to learn this task with just 30 minutes of online interaction. Our experiments across four tasks, show that HuDOR outperforms alternatives with an average of 4x improvement. Code and videos are available on our website website: https://object-rewards.github.io/.

15:30-15:35, Paper TuCT16.4	Add to My Program
Hand-Object Interaction Pre-Training from Videos

Singh, Himanshu Gaurav	University of California Berkeley
Loquercio, Antonio	University of Pennsylvania
Sferrazza, Carmelo	UC Berkeley
Wu, Jane	University of California, Berkeley
Qi, Haozhi	UC Berkeley
Abbeel, Pieter	UC Berkeley
Malik, Jitendra	UC Berkeley
Keywords: Representation Learning, Learning from Demonstration, Dexterous Manipulation Abstract: We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to alternate approaches.

15:35-15:40, Paper TuCT16.5	Add to My Program
SuperQ-GRASP: Superquadrics-Based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation

Tu, Xun	University of Minnesota, Twin Cities
Desingh, Karthik	University of Minnesota
Keywords: Perception for Grasping and Manipulation, Grasping, RGB-D Perception Abstract: Grasp planning and estimation have been a long-standing research problem in robotics, with two main approaches to find graspable poses on the objects: 1) geometric approach, which relies on 3D models of objects and the gripper to estimate valid grasp poses, and 2) data-driven, learning-based approach, with models trained to identify grasp poses from raw sensor observations. The latter assumes comprehensive geometric coverage during the training phase. However, the data-driven approach is typically biased toward tabletop scenarios and struggle to generalize to out-of-distribution scenarios with larger objects (e.g. chair). Additionally, raw sensor data (e.g. RGB-D data) from a single view of these larger objects is often incomplete and necessitates additional observations. In this paper, we take a geometric approach, leveraging advancements in object modeling (e.g. NeRF) to build an implicit model by taking RGB images from views around the target object. This model enables the extraction of explicit mesh model while also capturing the visual appearance from novel viewpoints that is useful for perception tasks like object detection and pose estimation. We further decompose the NeRF-reconstructed 3D mesh into superquadrics (SQs) - parametric geometric primitives, each mapped to a set of precomputed grasp poses, allowing grasp composition on the target object based on these primitives. Our proposed pipeline overcomes the problems: a) noisy depth and incomplete view of the object, with a modeling step, and b) generalization to objects of any size. For more qualitative results, refer to the supplementary video and webpage https://bit.ly/3ZrOanU.

15:40-15:45, Paper TuCT16.6	Add to My Program
Collaborative Motion Planning for Multi-Manipulator Systems through Reinforcement Learning and Dynamic Movement Primitives

Singh, Siddharth	University of Virginia
Xu, Tian	University of Virginia
Chang, Qing	University of Virginia
Keywords: Dual Arm Manipulation, Multi-Robot Systems, Manipulation Planning Abstract: Robotic tasks often require multiple manipulators to enhance task efficiency and speed, but this increases complexity in terms of collaboration, collision avoidance, and the expanded state-action space. To address these challenges, we propose a multi-level approach combining Reinforcement Learning (RL) and Dynamic Movement Primitives (DMP) to generate adaptive, real-time trajectories for new tasks in dynamic environments using a demonstration library. This method ensures collision-free trajectory generation and efficient collaborative motion planning. We validate the approach through experiments in the PyBullet simulation environment with UR5e robotic manipulators.


TuCT17 Regular Session, 405	Add to My Program
Soft Actuators 1

Chair: Blumenschein, Laura	Purdue University
Co-Chair: La, Hung	University of Nevada at Reno

15:15-15:20, Paper TuCT17.1	Add to My Program
Soft Robot Employing a Series of Pneumatic Actuators and Distributed Balloons: Modeling, Evaluation, and Applications

Ho, Van	Japan Advanced Institute of Science and Technology
Nguyen, Tuan	Japan Advanced Institute of Science and Technology
Nguyen, Dinh	VNU University of Engineering and Technology
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Mechanism Design, Modeling, Control, and Learning for Soft Robots Abstract: Tasks involving exploration and inspection of nar- row environments demand a robot to have a flexible body. Such a robot is especially preferred if the integrity of its surrounding is crucial, as in endoscopy procedures. We propose the design of a small, self-propelled soft robot that can operate in a constrained environment. By periodic activation of a series of pneumatic actu- ators fabricated using a casting technique, sinusoidal locomotion is achieved. The wave-like locomotive strategy with an additional support mechanism enabled movement in multiple scenarios, including traveling horizontally and vertically in environments of different characteristics. Two analytical models are presented to highlight the design characteristics. The first predicts the velocity of the robot in relation to the working conditions, while the second calculates the force that the robot body exerts on its sur- roundings. Its mobility was tested in simple and complex routes under rigid and elastic environments. The resulting percent errors for the predictions of velocity and lateral force are 7.89% and 16.86%, respectively. In terms of performance, the robot can move horizontally in rigid tubes even if

15:20-15:25, Paper TuCT17.2	Add to My Program
Compliance Control with Dynamic and Self-Sensing Hydraulic Artificial Muscles for Wearable Assistive Devices

Bibhu, Sharma	UNSW Sydney
Emanuele, Nicotra	UNSW Sydney
Davies, James J.	University of New South Wales
Chi Cong, Nguyen	University of New South Wales
Phuoc, Thien Phan	University of New South Wales
Ji, Adrienne	University of New South Wales
Zhu, Kefan	UNSW Sydney
Wan, Jingjing	University of New South Wales
Ngo, Trung Dung	University of Prince Edward Island
La, Hung	University of Nevada at Reno
Ho, Van	Japan Advanced Institute of Science and Technology
Lovell, Nigel Hamilton	University of New South Wales
Do, Thanh Nho	University of New South Wales
Keywords: Soft Robot Applications, Physically Assistive Devices, Wearable Robotics Abstract: While wearable robots that utilize intrinsically soft materials for actuation offer enhanced safety and biological compatibility, the challenges of sensing and control significantly affect their performance. The control problem in such systems is inherently complex, and the inclusion of ‘softness’ introduces additional nonlinearities, hysteresis, and uncertainties. Furthermore, the effectiveness of control strategies is highly dependent on sensor selection and integration, which presents its own challenges. Most robotic systems require separate sensors for control purposes. In this study, a new sensing and control scheme are introduced for soft wearable robots, leveraging the intrinsic soft-sensing capability of fluidic filament actuators without adding computational complexity. This method enables simultaneous sensing and actuation with 96% position accuracy, even under physical disturbances. This approach is demonstrated with a soft assistive device for elbow flexion/extension, achieving 70.5% tracking accuracy and a 0.09s response delay to human intention, ensuring the system provides minimal resistance when assistance is not needed, while delivering the required support when necessary.

15:25-15:30, Paper TuCT17.3	Add to My Program
Braided Artificial Muscle with Programmable Body Morphing and Its Application to Elbow Joint Flexion

Wu, Changchun	The University of Hong Kong
Liu, Hao	The University of Hong Kong
Lin, Senyuan	The University of Hong Kong
Yuan, Wenbo	The University of Hong Kong
Li, Yunquan	South China University of Technology
Lam, James	University of Hong Kong
Xi, Ning	The University of Hong Kong
Chen, Yonghua	The University of Hong Kong
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Soft Robot Applications Abstract: For pneumatic artificial muscles, it is always considered the more maximum contraction ratio the better. While for human joint assisting applications, PAMs with configurable maximum contraction rate are more suitable because of advantageous safety and adaptability. A PAM based on planar-to-specific-wave body shape morph is proposed in this work. Shape-morphing-based braided artificial muscles (SBAMs) have uniqueness of initial elasticity and maximum contraction ration programmability, which meet the favors of human joint assisting applications. The basic structure and working mechanism of contraction in SBAMs will be explained, and their mathematical model will also be established. According to the experimental results, a SBAM prototype generates a force more than 140 times its weight under an easily accessible pressure of 150 kPa. A mannequin wearing the SBAM enables actively flexes its elbow over 120 °.

15:30-15:35, Paper TuCT17.4	Add to My Program
Physics-Informed Hybrid Modeling of Pneumatic Artificial Muscles

Wang, Genmeng	Institut National Des Sciences Appliquees De Lyon
Chalard, Rémi	Université D'Évry Paris-Saclay
Jenny Alexandra, Cifuentes	Comillas Pontifical University
Pham, Minh Tu	INSA Lyon (Institut National Des Sciences Appliquees)
Keywords: Modeling, Control, and Learning for Soft Robots, Model Learning for Control, Calibration and Identification Abstract: Pneumatic Artificial Muscles (PAMs) are complex nonlinear systems characterized by hysteresis, making them challenging to model with classical system identification methods. While deep learning has emerged as a powerful tool for modeling nonlinear systems from data, purely neural network-based models often lack interpretability and are prone to overfitting. To address these challenges, this study explores several hybrid approaches that combine analytical models with neural networks to model PAM behavior more effectively. The results demonstrate that hybrid models significantly outperform both purely analytical and black-box neural network models, particularly in terms of generalization and dynamic accuracy. Among the approaches, the Physics-Informed Neural Network (PINN) unsupervised model shows the most robust performance, capturing complex PAM dynamics while maintaining computational efficiency. These findings suggest that hybrid modeling is a promising and scalable solution for accurately representing the intricate behavior of PAMs.

15:35-15:40, Paper TuCT17.5	Add to My Program
Anisotropic Stiffness and Programmable Actuation for Soft Robots Enabled by an Inflated Rotational Joint

Wang, Sicheng	Purdue University
Frias-Miranda, Eugenio	Purdue University
Alvarez Valdivia, Antonio	Purdue University
Blumenschein, Laura	Purdue University
Keywords: Soft Robot Materials and Design, Mechanism Design, Modeling, Control, and Learning for Soft Robots Abstract: Soft robots are known for their ability to perform tasks with great adaptability, enabled by their distributed, non-uniform stiffness and actuation. Bending is the most fundamental motion for soft robot design, but creating robust, and easy-to-fabricate soft bending joint with tunable properties remains an active problem of research. In this work, we demonstrate an inflatable actuation module for soft robots with a defined bending plane enabled by forced partial wrinkling. This lowers the structural stiffness in the bending direction, with the final stiffness easily designed by the ratio of wrinkled and unwrinkled regions. We present models and experimental characterization showing the stiffness properties of the actuation module, as well as its ability to maintain the kinematic constraint over a large range of loading conditions. We demonstrate the potential for complex actuation in a soft continuum robot and for decoupling actuation force and efficiency from load capacity. The module provides a novel method for embedding intelligent actuation into soft pneumatic robots.

15:40-15:45, Paper TuCT17.6	Add to My Program
Enhancement of Thin McKibben Muscle Durability under Repetitive Actuation in a Bent State

Kobayashi, Ryota	Tokyo Institute of Technology
Nabae, Hiroyuki	Institute of Science Tokyo
Mao, Zebing	Yamaguchi University
Endo, Gen	Institute of Science Tokyo
Suzumori, Koichi	Tokyo Institute of Technology
Keywords: Soft Sensors and Actuators, Soft Robot Applications Abstract: The McKibben muscle can produce a high force-to-mass ratio, beneficial for various applications in the soft mechatronics field. The thin McKibben muscle, which has a small diameter, has the advantage of a high force-to-mass ratio and sufficient flexibility for use in a bent state. This flexibility permits the realization of flexible mechatronics. However, the thin McKibben muscle is easily broken in a bent state while it is very durable in a straight state. Over repetitive operations, the fibers within the sleeve gradually shift, causing the rubber tube inside to protrude and ultimately leading to cracking. This study investigates improvements in the durability of artificial muscles using adhesives to prevent this fiber-to-fiber misalignment. The durability test showed that the adhesive could provide a durability of up to 10,000-times greater than that of a normal artificial muscle in the maximum case. Using the thin McKibben muscle with the proposed method, tensegrity modules were fabricated. The durability test revealed a 500-fold increase under an applied pressure of 0.5 MPa. Furthermore, the durability of the adhesive-applied artificial muscles was also confirmed to be enhanced during the dynamic movements of a soft tensegrity robot that throws a ball with 0.7 MPa.


TuCT18 Regular Session, 406	Add to My Program
Intelligent Transportation Systems

Chair: Li, Jiachen	University of California, Riverside
Co-Chair: Likhachev, Maxim	Carnegie Mellon University

15:15-15:20, Paper TuCT18.1	Add to My Program
Camera-Based Online Vectorized HD Map Construction with Incomplete Observation

Liu, Hui	Shandong University
Chang, Faliang	Shandong University
Liu, Chunsheng	Shandong University
Lu, Yansha	Shandong University
Liu, Minhang	Shandong University
Keywords: Intelligent Transportation Systems, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: Camera-based online map construction focuses on learning map elements from surround-view images. Distinguished with previous methods that rely on complete observations, we explore a new map construction problem under incomplete observations where one or more perspectives of the surround-view are missing due to camera damage or occlusion. Incomplete observations lead to inferior performance and may even result in failure. Map construction based on incomplete observations faces two challenges: supplementing missing perspective features and reducing the complexity of high-dimensional feature learning. To address these issues, we propose a novel Panoramic Observation Prior Network (POP-Net). Firstly, based on the observation switch training mechanism, we propose a Panoramic Learning Module (PL-Module). It establishes a learnable panoramic feature space, facilitating the extraction of panoramic features from incomplete observations, thus supplementing missing perspective features. Secondly, based on the feature decomposition mechanism, we design a Panoramic Decomposition-Aggregation Operation (PDA-Operation), which decomposes high-dimensional panoramic features into low-dimensional local scene features. This allows limited local scene features to represent diverse panoramic features, alleviating computational and memory burdens of high-dimensional feature learning. Experimental results demonstrate that our method surpasses existing approaches under incomplete observation scenarios.

15:20-15:25, Paper TuCT18.2	Add to My Program
Online Aggregation of Trajectory Predictors

Tong, Alex	Harvard University
Sharma, Apoorva	NVIDIA
Veer, Sushant	NVIDIA
Pavone, Marco	Stanford University
Yang, Heng	Harvard University
Keywords: Intelligent Transportation Systems, Autonomous Agents, Continual Learning Abstract: Trajectory prediction, the task of forecasting future agent behavior from past data, is central to safe and efficient autonomous driving. A diverse set of methods (e.g., rule-based or learned with different architectures and datasets) have been proposed, yet it is often the case that the performance of these methods is sensitive to the deployment environment (e.g., how well the design rules model the environment, or how accurately the test data match the training data). Building upon the principled theory of online convex optimization but also go- ing beyond convexity and stationarity, we present a lightweight and model-agnostic method to aggregate different trajectory predictors online. We propose to treat each single trajectory predictor as an “expert” and maintain a probability vector to mix the outputs of different experts. Then, the key technical approach lies in leveraging online data –the true agent behavior to be revealed at the next time step– to form a convex-or- nonconvex, stationary-or-dynamic loss function whose gradient steers the probability vector towards choosing the best mixture of experts. We instantiate this method to aggregate trajectory predictors trained on different cities in the NUSCENES dataset and show that it performs just as well, if not better than, any singular model, even when deployed on the LYFT dataset.

15:25-15:30, Paper TuCT18.3	Add to My Program
Gen-Drive: Enhancing Diffusion Generative Driving Policies with Reward Modeling and Reinforcement Learning Fine-Tuning

Huang, Zhiyu	Nanyang Technological University
Weng, Xinshuo	NVIDIA Corporation
Igl, Maximilian	Waymo LLC
Chen, Yuxiao	Nvidia Research
Cao, Yulong	NVIDIA
Ivanovic, Boris	NVIDIA
Pavone, Marco	Stanford University
Lv, Chen	Nanyang Technological University
Keywords: Intelligent Transportation Systems, Autonomous Agents, AI-Based Methods Abstract: Autonomous driving necessitates the ability to reason about future interactions between traffic agents and to make informed evaluations for planning. This paper intro-duces the Gen-Drive framework, which shifts from the traditional prediction and deterministic planning framework to a generation-then-evaluation planning paradigm. The framework employs a behavior diffusion model as a scene generator to produce diverse possible future scenarios, thereby enhancing the capability for joint interaction reasoning. To facilitate decision-making, we propose a scene evaluator (reward) model, trained with pairwise preference data collected through VLM assistance, thereby reducing human workload and enhancing scalability. Furthermore, we utilize an RL fine-tuning frame-work to improve the generation quality of the diffusion model, rendering it more effective for planning tasks. We conduct training and closed-loop planning tests on the nuPlan dataset, and the results demonstrate that employing such a generation-then-evaluation strategy outperforms other learning-based approaches. Additionally, the fine-tuned generative driving policy shows significant enhancements in planning performance. We further demonstrate that utilizing our learned reward model for evaluation or RL fine-tuning leads to better planning performance compared to relying on human-designed rewards. Project website: https://mczhi.github.io/GenDrive.

15:30-15:35, Paper TuCT18.4	Add to My Program
Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Xiao, Chuyang	ShanghaiTech University
Wang, Dawei	The University of Hong Kong
Tang, Xinzheng	The University of Hong Kong
Pan, Jia	University of Hong Kong
Ma, Yuexin	ShanghaiTech University
Keywords: Intelligent Transportation Systems, Autonomous Agents, Multi-Robot Systems Abstract: This paper presents a mixed traffic control policy designed to optimize traffic efficiency across diverse road topologies, addressing issues of congestion prevalent in urban environments. A model-free reinforcement learning (RL) approach is developed to manage large-scale traffic flow, using data collected by autonomous vehicles to influence human-driven vehicles. A real-world mixed traffic control benchmark is also released, which includes 444 scenarios from 20 countries, representing a wide geographic distribution and covering a variety of scenarios and road topologies. This benchmark serves as a foundation for future research, providing a realistic simulation environment for the development of effective policies. Comprehensive experiments demonstrate the effectiveness and adaptability of the proposed method, achieving better performance than existing traffic control methods in both intersection and roundabout scenarios. To the best of our knowledge, this is the first project to introduce a real-world complex scenarios mixed traffic control benchmark. Videos and code of our work are available at https://sites.google.com/berkeley.edu/mixedtrafficplus/home .

15:35-15:40, Paper TuCT18.5	Add to My Program
Internal-Stably Energy-Saving Cooperative Control of Articulated Wheeled Robot with Distributed Drive Units

Yang, Yi	Beijing Institute of Technology
Peng, Huishuai	Beijing Institute of Technology
Hu, Zhexi	Beijing Institute of Technology
Li, Haoyu	Beijing Institute of Techology
Xie, Shanshan	Beijing Institute of Technology
Keywords: Intelligent Transportation Systems, Motion Control, Wheeled Robots Abstract: Articulated wheeled robots play a crucial role in the logistics industry. However, conventional tractor-driven articulated wheeled robots exhibit poor internal stability and are prone to jackknifing, while also consuming a significant amount of energy. By deploying distributed drives and coordinating control among multiple drives, these issues can be effectively addressed. However, the flexible connections between the bodies of articulated vehicles pose significant challenges to the coordinated control of distributed drives. This paper proposes a multi-drive unit coordinated control algorithm based on driving force equivalence and allocation. A neural network is used to predict the driving force, and through nonlinear driving force equivalence, a feedforward driving force is obtained. This is combined with a closed-loop feedback compensation controller to form a control architecture that integrates feedforward and feedback, resulting in the equivalent total driving force for the vehicle queue. Subsequently, an equivalent distribution strategy allocates the required driving force to each drive, enabling the vehicle bodies to achieve accurate and stable speed tracking while allowing each drive to operate near its efficient operating point, thereby reducing total energy consumption. Experiments demonstrate that our algorithm significantly lowers the total energy consumption of the vehicle queue under standard operating conditions while ensuring speed-tracking accuracy and improving internal stability.

15:40-15:45, Paper TuCT18.6	Add to My Program
Fast-Poly: A Fast Polyhedral Algorithm for 3D Multi-Object Tracking

Li, Xiaoyu	Harbin Institute of Technology
Liu, Dedong	Harbin Institute of Technology
Wu, Yitao	Harbin Institute of Technology
Wu, Xian	Harbin Institute of Technology
Zhao, Lijun	Harbin Institute of Technology
Gao, Jinghan	Harbin Institute of Technology
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation Abstract: 3D Multi-Object Tracking (MOT) captures stable and comprehensive motion states of surrounding obstacles, essential for robotic perception. However, current 3D trackers face issues with accuracy and latency consistency. In this paper, we propose Fast-Poly, a fast and effective filter-based method for 3D MOT. Building upon our previous work Poly-MOT, Fast-Poly addresses object rotational anisotropy in 3D space, enhances local computation densification, and leverages parallelization technique, improving inference speed and precision. Fast-Poly is extensively tested on two large-scale tracking benchmarks with Python implementation. On the nuScenes dataset, Fast-Poly achieves new state-of-the-art performance with 75.8% AMOTA among all methods and can run at 34.2 FPS on a personal CPU. On the Waymo dataset, Fast-Poly exhibits competitive accuracy with 63.6% MOTA and impressive inference speed (35.5 FPS). The source code is publicly available at https://github.com/lixiaoyu2000/FastPoly.

15:45-15:50, Paper TuCT18.7	Add to My Program
Importance Sampling-Guided Meta-Training for Intelligent Agents in Highly Interactive Environments

Arief, Mansur	Stanford University
Timmerman, Mike	Stanford University
Li, Jiachen	University of California, Riverside
Isele, David	University of Pennsylvania, Honda Research Institute USA
Kochenderfer, Mykel	Stanford University
Keywords: Intelligent Transportation Systems, Reinforcement Learning, Planning under Uncertainty Abstract: Training intelligent agents to navigate highly interactive environments presents significant challenges. While guided meta reinforcement learning (RL) approach that first trains a guiding policy to train the ego agent has proven effective in improving generalizability across scenarios with various levels of interaction, the state-of-the-art method tends to be overly sensitive to extreme cases, impairing the agents' performance in the more common scenarios. This study introduces a novel training framework that integrates guided meta RL with importance sampling (IS) to optimize training distributions iteratively for navigating highly interactive driving scenarios, such as T-intersections or roundabouts. Unlike traditional methods that may underrepresent critical interactions or overemphasize extreme cases during training, our approach strategically adjusts the training distribution towards more challenging driving behaviors using IS proposal distributions and applies the importance ratio to de-bias the result. By estimating a naturalistic distribution from real-world datasets and employing a mixture model for iterative training refinements, the framework ensures a balanced focus across common and extreme driving scenarios. Experiments conducted with both synthetic and naturalistic datasets demonstrate both accelerated training and performance improvements under highly interactive driving tasks.


TuCT19 Regular Session, 407	Add to My Program
Medical Robot Systems

Chair: Webster III, Robert James	Vanderbilt University
Co-Chair: Menciassi, Arianna	Scuola Superiore Sant'Anna - SSSA

15:15-15:20, Paper TuCT19.1	Add to My Program
Design and Modeling of a Compact Spooling Mechanism for the COAST Guidewire Robot

Brumfiel, Timothy A.	Georgia Institute of Technology
Grinberg, Jared	Georgia Institute of Technology
Siopongco, Betina	Georgia Institute of Technology
Desai, Jaydev P.	Georgia Institute of Technology
Keywords: Medical Robots and Systems, Mechanism Design, Tendon/Wire Mechanism Abstract: The treatment of many intravascular procedures begins with a clinician manually placing a guidewire to the target lesion to aid in placing other devices. Manually steering the guidewire is challenging due to the lack of direct tip control and the high tortuosity of vessel structures, potentially resulting in vessel perforation or guidewire fracture. These challenges can be alleviated through the use of robotically steerable guidewires that can improve guidewire tip control, provide force feedback, and, similar to commercial guidewires, are inherently safe due to their compliant structure. However, robotic guidewires are not yet clinically viable due to small robot lengths or large actuation systems. In this paper, we develop a highly compact spooling mechanism for the COaxially Aligned STeerable (COAST) guidewire robot, capable of dispensing a clinically viable length of 1.5 m of the robotic guidewire. The mechanism utilizes a spool with several interior armatures to actuate each component of the COAST guidewire. The kinematics of the robotic guidewire are then modeled considering additional friction forces caused by interactions within the mechanism. The actuating mechanisms of the compact spooling mechanism are calibrated and the kinematics of the guidewire are validated resulting in an average curvature RMSE of 0.24 m−1.

15:20-15:25, Paper TuCT19.2	Add to My Program
Motion-Guided Dual-Camera Tracker for Endoscope Tracking and Motion Analysis in a Mechanical Gastric Simulator

Zhang, Yuelin	CUHK
Yan, Kim	The Chinese University of Hong Kong
Lam, Chun Ping	The Chinese University of Hong Kong
Fang, Chengyu	Tsinghua University
Xie, Wenxuan	The Chinese University of Hong Kong
Qiu, Yufu	The Chinese University of HongKong
Tang, Raymond Shing-Yan	The Chinese University of Hong Kong, Department of Medicine And
Cheng, Shing Shin	The Chinese University of Hong Kong
Keywords: Deep Learning Methods, Visual Tracking, Computer Vision for Medical Robotics Abstract: Flexible endoscope motion tracking and analysis in mechanical simulators have proven useful for endoscopy training. Common motion tracking methods based on electromagnetic tracker are however limited by their high cost and material susceptibility. In this work, the motion-guided dual-camera vision tracker is proposed to provide robust and accurate tracking of the endoscope tip's 3D position. The tracker addresses several unique challenges of tracking flexible endoscope tip inside a dynamic, life-sized mechanical simulator. To address the appearance variation and keep dual-camera tracking consistency, the cross-camera mutual template strategy (CMT) is proposed by introducing dynamic transient mutual templates. To alleviate large occlusion and light-induced distortion, the Mamba-based motion-guided prediction head (MMH) is presented to aggregate historical motion with visual tracking. The proposed tracker achieves superior performance against state-of-the-art vision trackers, achieving 42% and 72% improvements against the second-best method in average error and maximum error. Further motion analysis involving novice and expert endoscopists also shows that the tip 3D motion provided by the proposed tracker enables more reliable motion analysis and more substantial differentiation between different expertise levels, compared with other trackers. Project page: https://github.com/PieceZhang/MotionDCTrack

15:25-15:30, Paper TuCT19.3	Add to My Program
A System for Endoscopic Submucosal Dissection Featuring Concentric Push-Pull Manipulators

Connor, Peter	Vanderbilt University
Hatch, Carter	University of Tennessee
Dang, Khoa	University of Tennessee, Knoxville
Qin, Tony	University of North Carolina at Chapel Hill
Alterovitz, Ron	University of North Carolina at Chapel Hill
Rucker, Caleb	University of Tennessee
Webster III, Robert James	Vanderbilt University
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Flexible Robotics Abstract: Endoscopic Submucosal Dissection (ESD) is an effective minimally invasive approach to removing colon cancer, yet it is underutilized, since it is challenging to learn and perform. To promote the adoption of ESD by making it easier, we propose a system in which two small, flexible robotic manipulators are delivered through a colonoscope. Our system differs from prior robotic systems aimed at this application in that our manipulators are small enough to fit through a clinically used colonoscope. By not re-engineering the colonoscope, we maintain overall system diameter at the current clinical gold standard, and streamline the path to eventual clinical deployment. Our concentric push-pull robot (CPPR) manipulators offer dexterity and simultaneously provide a conduit for grasper or cutting tool deployment. Each manipulator in our system consists of two push-pull tube pairs, and we describe how they are actuated. We describe for the first time our approach to compensating for undesirable CPPR tip motion induced by differences in the tubes' transmission stiffness. We also evaluate the workspace of the manipulators and demonstrate teleoperation in a point-touching experiment. Lastly, we demonstrate the ability of the system to resect tissue via ex vivo animal experiments.

15:30-15:35, Paper TuCT19.4	Add to My Program
Quantitative Evaluation of Curved BioPrinted Constructs of an in Situ Robotic System towards Treatment of Volumetric Muscle Loss

Rezayof, Omid	University of Texas at Austin
Huang, Xinyuan	The University of Texas at Austin
Kamaraj, Meenakshi	Terasaki Institute for Biomedical Innovation, Los Angeles, Calif
John, Johnson V.	Terasaki Institute for Biomedical Innovation
Alambeigi, Farshid	University of Texas at Austin
Keywords: Medical Robots and Systems, Robotics and Automation in Life Sciences, Hardware-Software Integration in Robotics Abstract: Tissue engineering techniques and particularly in situ bioprinting using handheld devices and robotic systems have recently demonstrated promising outcomes to address volumetric muscle loss injuries. Nevertheless, these approaches suffer from insufficient printing precision and/or lack of quantitative analysis of the thickness and uniformity of bioprinted constructs (BPCs) - which are critical for ensuring cell viability and growth. To address these limitations, in this study, we present a framework for robotic bioprinting and complementary vision-based algorithms to quantitatively analyze thickness and uniformity of BPCs with curved geometries. The performance of the proposed robotic bioprinting and complementary algorithms has been thoroughly evaluated using various simulation and experimental studies on BPCs with constant and variable thicknesses. The results clearly demonstrate the remarkable and accurate performance of the proposed method in calculating the thickness and its variations along the geometry of the BPCs.

15:35-15:40, Paper TuCT19.5	Add to My Program
Design and Hysteresis Compensation of a Telerobotic System for Transesophageal Echocardiography

Zhang, Xiu	Politecnico Di Milano
Tamadon, Izadyar	University of Twente
Fortuno Jara, Benjamín Ignacio	Politecnico Di Milano
Cannizzaro, Vanessa	Politecnico Di Milano
Peloso, Angela	Politecnico Di Milano
Bicchi, Anna	Politecnico Di Milano
Aliverti, Andrea	Politecnico Di Milano
Votta, Emiliano	Politecnico Di Milano
Menciassi, Arianna	Scuola Superiore Sant'Anna - SSSA
De Momi, Elena	Politecnico Di Milano
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism Abstract: Transesophageal echocardiogram (TEE) plays an important role in diagnosing cardiac conditions such as valvular diseases and cardiac embolism, as well as guiding various cardiac interventions. It provides detailed cardiac imaging by inserting a probe into the esophagus, which offers an unobstructed view of the heart’s chambers and valves. Addressing the operational challenges and health risks of the sonographer associated with the manual procedure, a novel robotic TEE system is developed to teleoperate the TEE probe across all four degrees of freedom (4-DoFs). This actuation device features an easily assembled design for post-operative cleaning and sanitization. Moreover, this system enhances the precision of tip bending angles through an optimization technique for offline calibration of the actuation plane. The hysteresis effect inherent in the tendon-driven mechanism is characterized and compensated using a free knots B-spline method and a look-up table. Experiments are conducted in a realistic human cardiovascular phantom for preclinical evaluation. Repeatability experiments validate the system’s robustness. Furthermore, compared with the piecewise linear model, the proposed method achieves high accuracy with a median bending angle error of less than 0.8◦. The results demonstrate the system’s potential to significantly improve the autonomy of TEE procedures in cardiac diagnostic and therapeutic procedures.

15:40-15:45, Paper TuCT19.6	Add to My Program
A Magnetic Capsule Robot with an Exoskeleton to Withstand Esophageal Pressure and Delivery Drug in Stomach

Liu, Ruomao	City University of Hong Kong
Chen, Yujun	Tongji University
Yin, Zhen	Tongji University
Zhang, Jiachen	City University of Hong Kong
Keywords: Soft Robot Applications, Compliant Joints and Mechanisms, Robot Safety Abstract: Capsule medicine is one of the most widely used methods of drug delivery into the human digestive tract. Packaging drugs into capsules not only prevents contamination of the drug before reaching the destination, but also protects the digestive organs and respiratory tract from potential damages caused by drug reactions. After reaching the targeted digestive organs, the drugs in the orally taken capsules usually can only be released passively. Most capsule robots that have been proposed to release drugs actively did not consider the compressed pressure when they pass through the esophagus, which could lead to premature drug release. This letter proposes a magnetic capsule robot that can withstand intra-esophageal pressure and also has the advantages of active locomotion and on-demand drug releasing. The proposed robot consists of two permanent magnets, an exoskeleton, and a soft non-magnetic container. Thus, it can withstand intra-esophageal pressure when it passes through the esophagus. This capsule robot can enter the stomach for targeted drug releasing without leaking liquid drugs along the path. The behavior of the robot is controllable using an external magnetic field thanks to the ring-shaped magnets mounted on the robot's top and bottom sections. The non-magnetic drug container will not be influenced by the external magnetic field during the locomotion to prevent leakage. The experiments show that this proposed capsule robot is more relevant to real-world medical applications thanks to its unique capability to withstand esophageal pressure.

15:45-15:50, Paper TuCT19.7	Add to My Program
MINRob: A Large Force-Outputting Miniature Robot Based on a Triple-Magnet System

Xiang, Yuxuan	City University of Hong Kong
Liu, Ruomao	City University of Hong Kong
Wei, Zihan	City University of Hongkong
Wang, Xinliang	City University of Hong Kong
Kang, Weida	Harbin Institute of Technology, Shenzhen
Wang, Min	City University of Hong Kong
Liu, Jun	The University of Hong Kong
Liang, Xudong	Harbin Institute of Technology, Shenzhen
Zhang, Jiachen	City University of Hong Kong
Keywords: Medical Robots and Systems, Mechanism Design, Mobile Manipulation, Force Control Abstract: Magnetically actuated miniature robots are limited in their mechanical outputting capability, because the magnetic forces decrease significantly with decreasing robot size and increasing actuating distance. Hence, the output force of these robots can hardly meet the demand for specific biomedical applications (e.g., tissue penetration). This article proposes a tetherless magnetic impact needle robot (MINRob) based on a triple-magnet system with reversible and repeatable magnetic collisions to overcome this constraint on output force. The working procedure of the proposed system is divided into several states, and a mathematical model is developed to predict and optimize the force output. Measured force values indicate a 10-fold increase compared with existing miniature robots that only utilize magnetic attractive force. Eventually, MINRob is integrated with a teleoperation system, enabling remote and precise control of the robot's position and orientation. The triple-magnet system offers promising locomotion patterns and penetration capacity via the notably increased force output, showing great potential in robot-assisted tissue penetration in minimally invasive healthcare.


TuCT20 Regular Session, 408	Add to My Program
Mechanism Design and Control

Chair: Della Santina, Cosimo	TU Delft
Co-Chair: Luo, Wenhao	University of Illinois Chicago

15:15-15:20, Paper TuCT20.1	Add to My Program
Model-Free Safety Filter for Soft Robots: A Q-Learning Approach

Sue, Guo Ning (Andrew)	Carnegie Mellon University
Choudhary, Yogita	Carnegie Mellon University
Desatnik, Richard	Carnegie Mellon University
Majidi, Carmel	Carnegie Mellon University
Dolan, John M.	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Robot Safety, Reinforcement Learning, Modeling, Control, and Learning for Soft Robots Abstract: Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin’s car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

15:20-15:25, Paper TuCT20.2	Add to My Program
Reachability Analysis for Black-Box Dynamical Systems

Chilakamarri, Vamsi Krishna	Indian Institute of Technology Madras
Feng, Zeyuan	Stanford University
Bansal, Somil	Stanford University
Keywords: Robot Safety, Machine Learning for Robot Control, Optimization and Optimal Control Abstract: Hamilton-Jacobi (HJ) reachability analysis is a powerful framework for ensuring safety and performance in autonomous systems. However, existing methods typically rely on a white-box dynamics model of the system, limiting their applicability in many practical robotics scenarios where only a black-box model of the system is available. In this work, we propose a novel reachability method to compute reachable sets and safe controllers for black-box dynamical systems. Our approach efficiently approximates the Hamiltonian function using samples from the black-box dynamics. This Hamiltonian is then used to solve the HJ Partial Differential Equation (PDE), providing the reachable set of the system. The proposed method can be applied to general nonlinear systems and can be seamlessly integrated with existing reachability toolboxes for white-box systems to extend their use to black-box systems. Through simulation studies on a black-box slip-wheel car and a quadruped robot, we demonstrate the effectiveness of our approach in accurately obtaining the reachable sets for black-box dynamical systems.

15:25-15:30, Paper TuCT20.3	Add to My Program
SAFE-GIL: SAFEty Guided Imitation Learning for Robotic Systems

Ciftci, Yusuf Umut	University of Southern California
Chiu, Darren	University of Southern California
Feng, Zeyuan	Stanford University
Sukhatme, Gaurav	University of Southern California
Bansal, Somil	Stanford University
Keywords: Robot Safety, Machine Learning for Robot Control, Imitation Learning Abstract: Behavior cloning (BC) is a widely used approach in imitation learning, where a robot learns a control policy by observing an expert supervisor. However, the learned policy can make errors and might lead to safety violations, which limits their utility in safety-critical robotics applications. While prior works have tried improving a BC policy via additional real or synthetic action labels, adversarial training, or runtime filtering, none of them explicitly focus on reducing the BC policy's safety violations during training time. We propose SAFE-GIL, a design-time method to learn safety-aware behavior cloning policies. SAFE-GIL deliberately injects adversarial disturbance in the system during data collection to guide the expert towards safety-critical states. This disturbance injection simulates potential policy errors that the system might encounter during the test time. By ensuring that training more closely replicates expert behavior in safety-critical states, our approach results in safer policies despite policy errors during the test time. We further develop a reachability-based method to compute this adversarial disturbance. We compare SAFE-GIL with various behavior cloning techniques and online safety-filtering methods in three domains: autonomous ground navigation, aircraft taxiing, and aerial navigation on a quadrotor testbed. Our method demonstrates a significant reduction in safety failures, particularly in low data regimes where the likelihood of learning errors, and therefore safety violations, is higher. See our website here: https://y-u-c.github.io/safegil/

15:30-15:35, Paper TuCT20.4	Add to My Program
Computationally and Sample Efficient Safe Reinforcement Learning Using Adaptive Conformal Prediction

Zhou, Hao	University of Illinois Chicago
Zhang, Yanze	University of Illinois Chicago
Luo, Wenhao	University of Illinois Chicago
Keywords: Robot Safety, Model Learning for Control, Integrated Planning and Learning Abstract: Safety is a critical concern in learning-enabled autonomous systems especially when deploying these systems in real-world scenarios. An important challenge is accurately quantifying the uncertainty of unknown models to generate provably safe control policies that facilitate the gathering of informative data, thereby achieving both safe and optimal policies. Additionally, the selection of the data-driven model can significantly impact both the real-time implementation and the uncertainty quantification process. In this paper, we propose a provably sample efficient episodic safe learning framework that remains robust across various model choices with quantified uncertainty for online control tasks. Specifically, we first employ Quadrature Fourier Features (QFF) for kernel function approximation of Gaussian Processes (GPs) to enable efficient approximation of unknown dynamics. Then the Adaptive Conformal Prediction (ACP) is used to quantify the uncertainty from online observations and combined with the Control Barrier Functions (CBF) to characterize the uncertainty-aware safe control constraints under learned dynamics. Finally, an optimism-based exploration strategy is integrated with ACP-based CBFs for safe exploration and near-optimal safe nonlinear control. Theoretical proofs and simulation results are provided to demonstrate the effectiveness and efficiency of the proposed framework.

15:35-15:40, Paper TuCT20.5	Add to My Program
Guaranteed Reach-Avoid for Black-Box Systems through Narrow Gaps Via Neural Network Reachability

Chung, Long Kiu	Georgia Institute of Technology
Jung, Wonsuhk	Georgia Institute of Technology
Pullabhotla, Srivatsank	Georgia Institute of Technology
Shinde, Parth Kishor	Georgia Institute of Technology
Sunil, Yadu Krishna	Georgia Institute of Technology
Kota, Saihari	Georgia Institute of Technology
Batista, Luis F. W.	Georgia Instutue of Technology and Universite De Lorraine
Pradalier, Cedric	GeorgiaTech Lorraine
Kousik, Shreyas	Georgia Institute of Technology
Keywords: Robot Safety, Model Learning for Control, Motion Control Abstract: In the classical reach-avoid problem, autonomous mobile robots are tasked to reach a goal while avoiding obstacles. However, it is difficult to provide guarantees on the robot's performance when the obstacles form a narrow gap and the robot is a black-box (i.e. the dynamics are not known analytically, but interacting with the system is cheap). To address this challenge, this paper presents NeuralPARC. The method extends the authors' prior Piecewise Affine Reach-avoid Computation (PARC) method to systems modeled by rectified linear unit (ReLU) neural networks, which are trained to represent parameterized trajectory data demonstrated by the robot. NeuralPARC computes the reachable set of the network while accounting for modeling error, and returns a set of states and parameters with which the black-box system is guaranteed to reach the goal and avoid obstacles. NeuralPARC is shown to outperform PARC, generating provably-safe extreme vehicle drift parking maneuvers in simulations and in real life on a model car, as well as enabling safety on an autonomous surface vehicle (ASV) subjected to large disturbances and controlled by a deep reinforcement learning (RL) policy.

15:40-15:45, Paper TuCT20.6	Add to My Program
RAIL: Reachability-Aided Imitation Learning for Safe Policy Execution

Jung, Wonsuhk	Georgia Institute of Technology
Anthony, Dennis	Georgia Institute of Technology
Mishra, Utkarsh	Georgia Institute of Technology
Ranawaka Arachchige, Nadun	Georgia Institute of Technology
Bronars, Matthew	Carnegie Mellon University
Xu, Danfei	Georgia Institute of Technology
Kousik, Shreyas	Georgia Institute of Technology
Keywords: Robot Safety, Imitation Learning, Motion and Path Planning Abstract: Imitation learning (IL) has shown great success in learning complex robot manipulation tasks. However, there remains a need for practical safety methods to justify widespread deployment. In particular, it is important to certify that a system obeys hard constraints on unsafe behavior in settings when it is unacceptable to design a tradeoff between performance and safety via tuning the policy (i.e. soft constraints). This leads to the question, how does enforcing hard constraints impact the performance (meaning safely completing tasks) of an IL policy? To answer this question, this paper builds a reachability-based safety filter to enforce hard constraints on IL, which we call Reachability-Aided Imitation Learning (RAIL). Through evaluations with state-of-the-art IL policies in mobile robots and manipulation tasks, we make two key findings. First, the highest-performing policies are sometimes only so because they frequently violate constraints, and significantly lose performance under hard constraints. Second, surprisingly, hard constraints on the lower-performing policies can occasionally increase their ability to perform tasks safely. Finally, hardware evaluation confirms the method can operate in real time. More results can be found at our website: https://safe-robotics-lab-gt.github.io/rail/.

15:45-15:50, Paper TuCT20.7	Add to My Program
Safe Reinforcement Learning of Robot Trajectories in the Presence of Moving Obstacles

Kiemel, Jonas	Karlsruhe Institute of Technology
Righetti, Ludovic	New York University
Kroeger, Torsten	Intrinsic Innovation LLC
Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Keywords: Reinforcement Learning, Robot Safety, Motion Control Abstract: In this paper, we present an approach for learning collision-free robot trajectories in the presence of moving obstacles. As a first step, we train a backup policy to generate evasive movements from arbitrary initial robot states using model-free reinforcement learning. When learning policies for other tasks, the backup policy can be used to estimate the potential risk of a collision and to offer an alternative action if the estimated risk is considered too high. No matter which action is selected, our action space ensures that the kinematic limits of the robot joints are not violated. We analyze and evaluate two different methods for estimating the risk of a collision. A physics simulation performed in the background is computationally expensive but provides the best results in deterministic environments. If a data-based risk estimator is used instead, the computational effort is significantly reduced, but an additional source of error is introduced. For evaluation, we successfully learn a reaching task and a basketball task while keeping the risk of collisions low. The results demonstrate the effectiveness of our approach for deterministic and stochastic environments, including a human-robot scenario and a ball environment, where no state can be considered permanently safe. By conducting experiments with a real robot, we show that our approach can generate safe trajectories in real time.


TuCT21 Regular Session, 410	Add to My Program
Reinforcement Learning 3

Chair: Jiang, Chao	University of Wyoming
Co-Chair: Subosits, John	Toyota Research Institute

15:15-15:20, Paper TuCT21.1	Add to My Program
Decision Making for Multi-Robot Fixture Planning Using Multi-Agent Reinforcement Learning (I)

Canzini, Ethan	University of Sheffield
Auledas-Noguera, Marc	University of Sheffield
Pope, Simon A.	The University of Sheffield
Tiwari, Ashutosh	University of Sheffield
Keywords: Intelligent and Flexible Manufacturing, Multi-Robot Systems, Reinforcement Learning Abstract: Within the realm of flexible manufacturing, fixture layout planning allows manufacturers to rapidly deploy optimal fixturing plans that can reduce surface deformation that leads to crack propagation in components during manufacturing tasks. The role of fixture layout planning has evolved from being performed by experienced engineers to computational methods due to the number of possible configurations for components. Current optimisation methods commonly fall into sub-optimal positions due to the existence of local optima, with data-driven machine learning techniques relying on costly to collect labelled training data. In this paper, we present a framework for multi-agent reinforcement learning with team decision theory to find optimal fixturing plans for manufacturing tasks. We demonstrate our approach on two representative aerospace components with complex geometries across a set of drilling tasks, illustrating the capabilities of our method; we will compare this against state of the art methods to showcase our method’s improvement at finding optimal fixturing plans with 3 times the improvement in deformation control within tolerance bounds.

15:20-15:25, Paper TuCT21.2	Add to My Program
Reference-Free Formula Drift with Reinforcement Learning: From Driving Data to Tire Energy-Inspired, Real-World Policies

Djeumou, Franck	University of Texas, Austin
Thompson, Michael	Toyota Research Institute
Suminaka, Makoto	Toyota Research Institute
Subosits, John	Toyota Research Institute
Keywords: Reinforcement Learning, Model Learning for Control, Planning under Uncertainty Abstract: The skill to drift a car--i.e., operate in a state of controlled oversteer like professional drivers-- could give future autonomous cars maximum flexibility when they need to retain control in adverse conditions or avoid collisions. We investigate real-time drifting strategies that put the car where needed while bypassing expensive trajectory optimization. To this end, we design a reinforcement learning agent that builds on the concept of tire energy absorption to autonomously drift through changing and complex waypoint configurations while safely staying within track bounds. We achieve zero-shot deployment on the car by training the agent in a simulation environment built on top of a neural stochastic differential equation vehicle model learned from pre-collected driving data. Experiments on a Toyota GR Supra and Lexus LC 500 show that the agent is capable of drifting smoothly through varying waypoint configurations with tracking error as low as 10 cm while stably pushing the vehicles to sideslip angles of up to 63°.

15:25-15:30, Paper TuCT21.3	Add to My Program
FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

Hu, Jiaheng	UT Austin
Hendrix, Rose	Allen Institute for AI
Farhadi, Ali	University of Washington
Kembhavi, Aniruddha	Allen Institute for AI
Martín-Martín, Roberto	University of Texas at Austin
Stone, Peter	University of Texas at Austin
Zeng, Kuo-Hao	Allen Institute for AI
Ehsani, Kiana	Allen Institute for Artificial Intelligence
Keywords: Reinforcement Learning, Mobile Manipulation, Vision-Based Navigation Abstract: In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning (BC). However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale RL fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art performance on both previously demonstrated and entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79.5%, with absolute improvements of +23.6% in simulation and +30.7% in real-world settings over prior state-of-the-art methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Code and website at https://robot-flare.github.io/.

15:30-15:35, Paper TuCT21.4	Add to My Program
Suite-IN: Aggregating Motion Features from Apple Suite for Robust Inertial Navigation

Sun, Lan	Shanghai Jiao Tong University
Xia, Songpengcheng	Shanghai Jiao Tong University
Deng, Junyuan	Shanghai Jiao Tong University
Yang, Jiarui	Shanghai Jiao Tong University
Lai, Zengyuan	Shanghai Jiao Tong University
Wu, Qi	Shanghai Jiao Tong University
Pei, Ling	Shanghai Jiao Tong University
Keywords: Localization, Datasets for Human Motion, Sensor Fusion Abstract: With the rapid development of wearable technology, devices like smartphones, smartwatches, and headphones equipped with IMUs have become essential for applications such as pedestrian positioning. However, traditional pedestrian dead reckoning (PDR) methods struggle with diverse motion patterns, while recent data-driven approaches, though improving accuracy, often lack robustness due to reliance on a single device. In our work, we attempt to enhance the positioning performance using the low-cost commodity IMUs embedded in the wearable devices. We propose a multi-device deep learning framework named Suite-IN, aggregating motion data from Apple Suite for inertial navigation. Motion data captured by sensors on different body parts contains both local and global motion information, making it essential to reduce the negative effects of localized movements and extract global motion representations from multiple devices.Our model innovatively introduces a contrastive learning module to disentangle motion-shared and motion-private latent representations, enhancing positioning accuracy. We validate our method on a self-collected dataset consisting of Apple Suite: iPhone, Apple Watch and Airpods, which supports a variety of movement patterns and flexible device configurations. Experimental results demonstrate that our approach outperforms state-of-the-art models while maintaining robustness across diverse sensor configurations.

15:35-15:40, Paper TuCT21.5	Add to My Program
Sample-Efficient Unsupervised Policy Cloning from Ensemble Self-Supervised Labeled Videos

Liu, Xin	Institute of Automation, Chinese Academy of Sciences
Chen, Yaran	Institute of Automation, Chinese Academy of Sciense
Li, Haoran	Institute of Automation, Chinese Academy of Sciences
Keywords: Computer Vision for Automation, Reinforcement Learning Abstract: Current advanced policy learning methodologies have demonstrated the ability to develop expert-level strategies when provided enough information. However, their requirements, including task-specific rewards, action-labeled expert trajectories, and huge environmental interactions, can be expensive or even unavailable in many scenarios. In contrast, humans can efficiently acquire skills within a few trials and errors by imitating easily accessible internet videos, in the absence of any other supervision. In this paper, we try to let machines replicate this efficient watching-and-learning process through Unsupervised Policy from Ensemble Self-supervised labeled Videos (UPESV), a novel framework to efficiently learn policies from action-free videos without rewards and any other expert supervision. UPESV trains a video labeling model to infer the expert actions in expert videos through several organically combined self-supervised tasks. Each task performs its duties, and they together enable the model to make full use of both action-free videos and reward-free interactions for robust dynamics understanding and advanced action prediction. Simultaneously, UPESV clones a policy from the labeled expert videos, in turn collecting environmental interactions for self-supervised tasks. After a sample-efficient, unsupervised, and iterative training process, UPESV obtains an advanced policy based on a robust video labeling model. Extensive experiments in sixteen challenging procedurally generated environments demonstrate that the proposed UPESV achieves state-of-the-art interaction-limited policy learning performance (outperforming five current advanced baselines on 12/16 tasks) without exposure to any other supervision except for videos.

15:40-15:45, Paper TuCT21.6	Add to My Program
PrivilegedDreamer: Explicit Imagination of Privileged Information for Rapid Adaptation of Learned Policies

Byrd, Morgan	Georgia Institute of Technology
Crandell, Jackson	Georgia Institute of Technology
Das, Mili	Georgia Institute of Technology
Inman, Jessica	Georgia Tech Research Institute
Wright, Robert	Georgia Tech Research Institute
Ha, Sehoon	Georgia Institute of Technology
Keywords: Reinforcement Learning Abstract: Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden parameters, ranging from autonomous driving to robotic manipulation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden-parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing approaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden parameters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce PrivilegedDreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and domain adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.

15:45-15:50, Paper TuCT21.7	Add to My Program
Dynamic Non-Prehensile Object Transport Via Model-Predictive Reinforcement Learning

Jawale, Neel Anand	University of Washington
Boots, Byron	University of Washington
Sundaralingam, Balakumar	NVIDIA Corporation
Bhardwaj, Mohak	University of Washington
Keywords: Reinforcement Learning, Optimization and Optimal Control, Learning from Demonstration Abstract: We investigate the problem of teaching a robot manipulator to perform dynamic non-prehensile object transport, also known as the ‘robot waiter’ task, from a limited set of real-world demonstrations. We propose an approach that combines batch reinforcement learning (RL) with model-predictive control (MPC) by pretraining an ensemble of value functions from demonstration data, and utilizing them online within an uncertainty-aware MPC scheme to ensure robustness to limited data coverage. Our approach is straightforward to integrate with off-the-shelf MPC frameworks and enables learning solely from task space demonstrations with sparsely labeled transitions, while leveraging MPC to ensure smooth joint space motions and constraint satisfaction. We validate the proposed approach through extensive simulated and real-world experiments on a Franka Panda robot performing the robot waiter task and demonstrate robust deployment of value functions learned from 50-100 demonstrations. Furthermore, our approach enables generalization to novel objects not seen during training and can improve upon suboptimal demonstrations. We believe that such a framework can reduce the burden of providing extensive demonstrations and facilitate rapid training of robot manipulators to perform non-prehensile manipulation tasks.


TuCT22 Regular Session, 411	Add to My Program
Deep Learning for Visual Perception 1

Chair: Chung, Jen Jen	The University of Queensland
Co-Chair: Jenkins, Odest Chadwicke	University of Michigan

15:15-15:20, Paper TuCT22.1	Add to My Program
Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

Wang, Ruiyu	KTH Royal Institute of Technology
Zhuang, Zheyu	KTH Royal Institute of Technology
Jin, Shutong	KTH Royal Institute of Technology
Ingelhag, Nils	KTH Royal Institute of Technology
Kragic, Danica	KTH
Pokorny, Florian T.	KTH Royal Institute of Technology
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Deep Learning for Visual Perception Abstract: An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

15:20-15:25, Paper TuCT22.2	Add to My Program
JRN-Geo: A Joint Perception Network Based on RGB and Normal Images for Cross-View Geo-Localization

Zhou, Hongyu	Northeastern University
Zhang, Yunzhou	Northeastern University
Huang, Tingsong	University of Sheffield
Ge, Fawei	Northeastern University
Qi, Man	Northeastern University
Zhang, Xichen	Northeastern University
Zhang, Yizhong	Northeastern University
Keywords: Representation Learning, Localization, Deep Learning for Visual Perception Abstract: Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network’s ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint variations, achieving state-of-the-art performance.

15:25-15:30, Paper TuCT22.3	Add to My Program
U^2Frame: A Unified and Unsupervised Learning Framework for LiDAR-Based Loop Closing

Yixin, Zhang	Sun Yat-Sen University
Ao, Sheng	Sun Yat-Sen University
Zhang, Ye	Sun Yat-Sen University
Song, Zhuo	University of Chinese Academy of Sciences
Qingyong, Hu	University of Oxford
Chang, Tao	National University of Defense Technology
Guo, Yulan	Sun Yat-Sen University
Keywords: Deep Learning Methods, Localization, Mapping Abstract: Loop closing is critically important in Simultaneous Localization and Mapping (SLAM) due to its ability to correct accumulated localization errors. However, existing methods are hindered by the difficulty of acquiring pose labels and the unreliability of ground truth data. In this paper, we propose U^2Frame, a unified and unsupervised learning framework for LiDAR-based loop closing. Specifically, the natural temporal-spatial correlation in point cloud sequences is first leveraged to supervise the network training, where near scans are treated as positives and vice versa as negatives. A new neural architecture is then constructed to jointly learn highly discriminative local and global features for loop closure detection. Additionally, an effective candidate verification module that exploits high-order geometric information is presented to further filter out false loop closures and estimate precise poses. We extensively evaluate U^2Frame on multiple datasets according to two tasks derived from loop closing: place recognition and loop pose estimation. Comparative experiments demonstrate that our method outperforms existing state-of-the-art supervised techniques and has a strong generalization ability across unseen scenarios. Code will be released soon.

15:30-15:35, Paper TuCT22.4	Add to My Program
Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data

Chen, Chao	New York University
Cheng, Zegang	New York University
Liu, Xinhao	New York University
Li, Yiming	New York University
Ding, Li	Amazon
Wang, Ruoyu	New York University
Feng, Chen	New York University
Keywords: Deep Learning for Visual Perception, Recognition, Localization Abstract: Visual place recognition (VPR) using deep networks has achieved state-of-the-art performance. However, most of them require a training set with ground truth sensor poses to obtain positive and negative samples of each observation’s spatial neighborhood for supervised learning. When such information is unavailable, temporal neighborhoods from a sequentially collected data stream could be exploited for self-supervised training, although we find its performance suboptimal. Inspired by noisy label learning, we propose a novel self-supervised framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods. Our method follows an iterative training paradigm which alternates between: (1) representation learning with data augmentation, (2) positive set expansion to include the current feature space neighbors, and (3) positive set contraction via geometric verification. We conduct auto-labeling and generalization tests on both simulated and real datasets, with either RGB images or point clouds as inputs. The results show that our method outperforms self-supervised baselines in recall rate, robustness, and heading diversity, a novel metric we propose for VPR. Our code and datasets can be found at https://ai4ce.github.io/TF-VPR/.

15:35-15:40, Paper TuCT22.5	Add to My Program
AiSDF: Structure-Aware Neural Signed Distance Fields in Indoor Scenes

Jang, Jaehoon	Ulsan National Institue of Science and Technology
Lee, Inha	Ulsan National Institute of Science & Technology
Kim, Minje	Ulsan National Institute of Science & Technology
Joo, Kyungdon	UNIST
Keywords: Deep Learning for Visual Perception, Mapping, Incremental Learning Abstract: Indoor scenes we are living in are visually homogenous or textureless, while they inherently have structural forms and provide enough structural priors for 3D scene reconstruction. Motivated by this fact, we propose a structure-aware online signed distance fields (SDF) reconstruction framework in indoor scenes, especially under the Atlanta world (AW) assumption. Thus, we dub this incremental SDF reconstruction for AW as AiSDF. Within the online framework, we infer the underlying Atlanta structure of a given scene and then estimate planar surfel regions supporting the Atlanta structure. This Atlanta-aware surfel representation provides an explicit planar map for a given scene. In addition, based on these Atlanta planar surfel regions, we adaptively sample and constrain the structural regularity in the SDF reconstruction, which enables us to improve the reconstruction quality by maintaining a high-level structure while enhancing the details of a given scene. We evaluate the proposed AiSDF on the ScanNet and ReplicaCAD datasets, where we demonstrate that the proposed framework is capable of reconstructing fine details of objects implicitly, as well as structures explicitly in room-scale scenes.

15:40-15:45, Paper TuCT22.6	Add to My Program
Targeted Hard Sample Synthesis Based on Estimated Pose and Occlusion Error for Improved Object Pose Estimation

Li, Alan	University of Toronto
Schoellig, Angela P.	TU Munich
Keywords: Computer Vision for Automation, RGB-D Perception, Deep Learning for Visual Perception Abstract: 6D Object pose estimation is a fundamental component in robotics enabling efficient interaction with the environment. It is particularly challenging in bin-picking applications, where objects may be in difficult poses, and occlusion between objects of the same type can cause confusion even in well-trained models. We propose a novel method of hard example synthesis that is model-agnostic, using existing simulators and the modelling of pose error in both the camera-to-object views-phere and occlusion space. Through evaluation of the model performance with respect to the distribution of object poses and occlusions, we discover regions of high error and generate realistic training samples to specifically target these regions. We demonstrate an improvement in correct detection rate of up to 20% across several ROBI-dataset objects and also state-of-the-art pose estimation models.

15:45-15:50, Paper TuCT22.7	Add to My Program
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation

Opipari, Anthony	University of Michigan
Krishnan, Aravindhan	Amazon Lab126
Gayaka, Shreekant	Amazon
Sun, Min	National Tsing Hua University
Kuo, Cheng-Hao	Amazon
Sen, Arnab	Amazon
Jenkins, Odest Chadwicke	University of Michigan
Keywords: Object Detection, Segmentation and Categorization, Data Sets for Robotic Vision, RGB-D Perception Abstract: This paper presents a method for generating large-scale datasets to improve class-agnostic video segmentation across robots with different form factors. Specifically, we consider the question of whether video segmentation models trained on generic segmentation data could be more effective for particular robot platforms if robot embodiment is factored into the data generation process. To answer this question, a pipeline is formulated for using 3D reconstructions (e.g. from HM3DSem[1]) to generate segmented videos that are configurable based on a robot’s embodiment (e.g. sensor type, sensor placement, and illumination source). A resulting massive RGB-D video panoptic segmentation dataset (MVPd) is introduced for extensive benchmarking with foundation and video segmentation models, as well as to support embodiment-focused research in video segmentation. Our experimental findings demonstrate that using MVPd for finetuning can lead to performance improvements when transferring foundation models to certain robot embodiments, such as specific camera placements. These experiments also show that using 3D modalities (depth images and camera pose) can lead to improvements in video segmentation accuracy and consistency. Project page: https://topipari.com/projects/MVPd


TuCT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 1

Chair: Laugier, Christian	INRIA
Co-Chair: Zhao, Hang	Tsinghua University

15:15-15:20, Paper TuCT23.1	Add to My Program
Characterizing and Optimizing the Tail Latency for Autonomous Vehicle Systems

Liu, Haolan	University of California San Diego
Wang, Zixuan	University of California San Diego
Zhao, Jishen	UC San Diego
Keywords: Engineering for Robotic Systems, Software-Hardware Integration for Robot Systems, Methods and Tools for Robot System Design Abstract: Autonomous vehicles (AVs) systems are envisioned to revolutionize our life by providing safe, relaxing, and convenient ground transportation. To ensure safety, AV systems need to make timely driving decisions in response to complicated and highly dynamic real-world driving environments. We present a systematic study to understand the causes of tail latency in AV systems and their impact on safety. We empirically analyze the design of two open-source industrial AV systems, Baidu Apollo and Autoware. We explore how pipelined computation design (such as module dependency and execution patterns), traffic factors (surrounding environments of AV), and system factors (such as cache contention) impact AV systems' tail latency. Inspired by the insights, We propose a set of systematic designs that lead to performance and safety improvements of up to 1.65× and 14×, respectively.

15:20-15:25, Paper TuCT23.2	Add to My Program
MORDA: A Synthetic Dataset to Facilitate Adaptation of Object Detectors to Unseen Real-Target Domain While Preserving Performance on Real-Source Domain

Lim, Hojun	MORAI
Yoo, Heecheol	MORAI
Lee, Jinwoo	MORAI
Jeon, Seungmin	MORAI Inc
Jeon, Hyeongseok	MORAI Inc
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization Abstract: Deep neural network (DNN) based perception models are indispensable in the development of autonomous vehicles (AVs). However, their reliance on large-scale, high-quality data is broadly recognized as a burdensome necessity due to the substantial cost of data acquisition and labeling. Further, the issue is not a one-time concern as AVs might need a new dataset if they are to be deployed to another region (real-target domain) that the in-hand dataset within the real-source domain cannot incorporate. To mitigate this burden, we propose leveraging synthetic environments as an auxiliary domain where the characteristics of real domains are reproduced. This approach could enable indirect experience about the real-target domain in a time- and cost-effective manner. As a practical demonstration of our methodology, nuScenes and South Korea are employed to represent real-source and real-target domains, respectively. That means we construct digital twins for several regions of South Korea, and the data-acquisition framework of nuScenes is reproduced. Blending the aforementioned components within a simulator allows us to obtain a synthetic-fusion domain in which we forge our novel driving dataset, MORDA: Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation. To verify the value of synthetic features that MORDA provides in learning about driving environments of South Korea, 2D/3D detectors are trained solely on a combination of nuScenes and MORDA. Afterward, their performance is evaluated on the unforeseen real-world dataset(AI-Hub) collected in South Korea. Our experiments present that MORDA can significantly improve mean Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or slightly enhanced. Details on MORDA can be accessed at https://morda-e8d07e.gitlab.io.

15:25-15:30, Paper TuCT23.3	Add to My Program
Towards Latency-Aware 3D Streaming Perception for Autonomous Driving

Peng, Jiaqi	Tsinghua University
Wang, Tai	Shanghai AI Laboratory
Pang, Jiangmiao	Shanghai AI Laboratory
Shen, Yuan	Tsinghua University
Keywords: Deep Learning for Visual Perception, Sensor Fusion Abstract: Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical data regardless of varying latency; 2) latency-aware predictive detection, a mechanism that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.

15:30-15:35, Paper TuCT23.4	Add to My Program
DRIVE: Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving

Lai, Songning	The Hong Kong University of Science and Technology (Guangzhou)
Xue, Tianlang	HongKong University of Science and Technology(GuangZhou)
Xiao, Hongru	Tongji University
Hu, Lijie	KAUST
Wu, Jiemin	Hong Kong University of Science and Technology (Guangzhou)
Feng, Ninghui	The Hong Kong University of Science and Technology (Guangzhou)
Guan, Runwei	University of Liverpool
Haicheng, Liao	University of Macau
Li, Zhenning	University of Macau
Yue, Yutao	Hong Kong University of Science and Technology (Guangzhou)
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Transportation Abstract: Recent advancements in autonomous driving have seen a paradigm shift towards end-to-end learning paradigms, which map sensory inputs directly to driving actions, thereby enhancing the robustness and adaptability of autonomous vehicles. However, these models often sacrifice interpretability, posing significant challenges to trust, safety, and regulatory compliance. To address these issues, we introduce DRIVE -- Dependable Robust Interpretable Visionary Ensemble Framework in Autonomous Driving, a comprehensive framework designed to improve the dependability and stability of explanations in end-to-end unsupervised autonomous driving models. Our work specifically targets the inherent instability problems observed in the Driving through the Concept Gridlock (DCG) model, which undermine the trustworthiness of its explanations and decision-making processes. We define four key attributes of textbf{DRIVE}: consistent interpretability, stable interpretability, consistent output, and stable output. These attributes collectively ensure that explanations remain reliable and robust across different scenarios and perturbations. Through extensive empirical evaluations, we demonstrate the effectiveness of our framework in enhancing the stability and dependability of explanations, thereby addressing the limitations of current models. Our contributions include an in-depth analysis of the dependability issues within the DCG model, a rigorous definition of DRIVE with its fundamental properties, a framework to implement DRIVE, and novel metrics for evaluating the dependability of concept-based explainable autonomous driving models. These advancements lay the groundwork for the development of more reliable and trusted autonomous driving systems, paving the way for their broader acceptance and deployment in real-world applicatio

15:35-15:40, Paper TuCT23.5	Add to My Program
Dur360BEV: A Real-World 360-Degree Single Camera Dataset and Benchmark for Bird-Eye View Mapping in Autonomous Driving

E, Wenke	Durham University
Yuan, Chao	Durham University
Li, Li	Durham University
Sun, Yixin	Durham University
A. Gaus, Yona Falinie	Durham University
Atapour-Abarghouei, Amir	Durham University
Breckon, Toby	Durham University
Keywords: Data Sets for Robotic Vision, Omnidirectional Vision, Deep Learning for Visual Perception Abstract: We present Dur360BEV, a novel spherical camera autonomous driving dataset equipped with a high-resolution 128-channel 3D LiDAR and a RTK-refined GNSS/INS system, along with a benchmark architecture designed to generate Bird-Eye-View (BEV) maps using only a single spherical camera. This dataset and benchmark address the challenges of BEV generation in autonomous driving, particularly by reducing hardware complexity through the use of a single 360-degree camera instead of multiple perspective cameras. Within our benchmark architecture, we propose a novel spherical-image-to-BEV module that leverages spherical imagery and a refined sampling strategy to project features from 2D to 3D. Our approach also includes an innovative application of focal loss, specifically adapted to address the extreme class imbalance often encountered in BEV segmentation tasks, that demonstrates improved segmentation performance on the Dur360BEV dataset. The results show that our benchmark not only simplifies the sensor setup but also achieves competitive performance. Code + Dataset: https://github.com/Tom-E-Durham/Dur360BEV

15:40-15:45, Paper TuCT23.6	Add to My Program
MVCTrack: Boosting 3D Point Cloud Tracking Via Multimodal-Guided Virtual Cues

Hu, Zhaofeng	Stony Brook University
Zhou, Sifan	Southeast University
Yuan, Zhihang	Houmo AI
Yang, Dawei	Houmo
Zhao, Shibo	Carnegie Mellon University
Liang, Ci-Jyun	Stony Brook University
Keywords: Visual Tracking, Human Detection and Tracking, Sensor Fusion Abstract: 3D single object tracking plays a crucial role in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To overcome these limitations, we propose Multimodal-guided virtual cue projection (MVCP) scheme to generate virtual cues for sparse point cloud. Furthermore, we also construct a enhanced tracker called MVCTrack based the generated virtual cues.Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to generate dense 3D virtual points that enhance the originally sparse 3D point cloud. These virtual points can naturally integrate with existing LiDAR-based 3D detectors, resulting in significant performance improvements. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset. Code is available at https://github.com/StiphyJay/MVCTrack

15:45-15:50, Paper TuCT23.7	Add to My Program
Chameleon: Fast-Slow Neuro-Symbolic Lane Topology Extraction

Zhang, Zongzheng	Tsinghua University
Li, Xinrun	Newcastle University
Zou, Sizhe	Beijing Jiaotong University
Chi, Guoxuan	Tsinghua University
Li, Siqi	Zhejiang University
Qiu, Xuchong	Bosch
Wang, Guoliang	Institute for AI Industry Research (AIR), Tsinghua University
Zheng, Guantian	Huazhong University of Science and Technology
Wang, LeiChen	Robert Bosch CN
Zhao, Hang	Tsinghua University
Zhao, Hao	Tsinghua University
Keywords: Cognitive Modeling, Object Detection, Segmentation and Categorization Abstract: Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by visionlanguage foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chainof-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-v2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic.


TuCT24 Regular Session, 401	Add to My Program
Novel Sensors

Chair: Sinapov, Jivko	Tufts University
Co-Chair: Draelos, Mark	University of Michigan

15:15-15:20, Paper TuCT24.1	Add to My Program
ERetinex: Event Camera Meets Retinex Theory for Low-Light Image Enhancement

Guo, Xuejian	Xi'an Jiaotong University
Tian, Zhiqiang	Xi'an Jiaotong University
Wang, Yuehang	Jilin University
Li, Siqi	Tsinghua University
Jiang, Yu	Jilin University
Du, Shaoyi	Xi'an Jiaotong University
Gao, Yue	Tsinghua University
Keywords: Deep Learning for Visual Perception, Visual Learning, Sensor Fusion Abstract: Low-light image enhancement aims to restore the under-exposure image captured in dark scenarios. Under such scenarios, traditional frame-based cameras may fail to capture the structure and color information due to the exposure time limitation. Event cameras are bio-inspired vision sensors that respond to pixel-wise brightness changes asynchronously. Event cameras’ high dynamic range is pivotal for visual perception in extreme low-light scenarios, surpassing traditional cameras and enabling applications in challenging dark environments. In this paper, inspired by the success of the retinex theory for traditional frame-based low-light image restoration, we introduce the first methods that combine the retinex theory with event cameras and propose a novel retinex-based low-light image restoration framework named ERetinex. Among our contributions, the first is developing a new approach that leverages the high temporal resolution data from event cameras with traditional image information to estimate scene illumination accurately. This method outperforms traditional image-only techniques, especially in low-light environments, by providing more precise lighting information. Additionally, we propose an effective fusion strategy that combines the high dynamic range data from event cameras with the color information of traditional images to enhance image quality. Through this fusion, we can generate clearer and more detail-rich images, maintaining the integrity of visual information even under extreme lighting conditions. The experimental results indicate that our proposed method outperforms state-of-the-art (SOTA) methods, achieving a gain of 1.0613 dB in PSNR while reducing FLOPS by 84.28%. The code is available at https://github.com/lodew920/ERetinex.

15:20-15:25, Paper TuCT24.2	Add to My Program
ThermoStereoRT: Thermal Stereo Matching in Real Time Via Knowledge Distillation and Attention-Based Refinement

Hu, Anning	Shanghai Jiao Tong University
Li, Ang	Shanghai Jiao Tong University
Jin, Xirui	Shanghai Jiao Tong University
Zou, Danping	Shanghai Jiao Ton University
Keywords: Deep Learning for Visual Perception, RGB-D Perception Abstract: We introduce ThermoStereoRT, a real-time thermal stereo matching method designed for all-weather conditions that recovers disparity from two rectified thermal stereo images, envisioning applications such as night-time drone surveillance or under-bed cleaning robots. Leveraging a lightweight yet powerful backbone, ThermoStereoRT constructs a 3D cost volume from thermal images and employs multi-scale attention mechanisms to produce an initial disparity map. To refine this map, we design a novel channel and spatial attention module. Addressing the challenge of sparse ground truth data in thermal imagery, we utilize knowledge distillation to boost performance without increasing computational demands. Comprehensive evaluations on multiple datasets demonstrate that ThermoStereoRT delivers both real-time capacity and robust accuracy, making it a promising solution for real-world deployment in various challenging environments.Our code will be released on https://github.com/SJTU-ViSYS-team/ThermoStereoRT.

15:25-15:30, Paper TuCT24.3	Add to My Program
Tool-Mediated Robot Perception of Granular Substances Using Multiple Sensory Modalities

Liu, Si	TUFTS
Sinapov, Jivko	Tufts University
Keywords: Recognition, Learning Categories and Concepts, Robot Audition Abstract: People use tools to interact with and perceive the world, with multimodal sensory inputs forming the basis of how we understand our environment. For example, a blind person uses a walking cane to tap the road and detect obstacles, and a builder uses a hammer to strike a wall to assess its structural integrity. Using tools extends our sensory capabilities during exploratory behaviors, enabling us to perceive object properties that are otherwise inaccessible. Inspired by this cognitive process, we propose a framework in which a multi-sensory robot employs exploratory behaviors using various tools to recognize granular substances. Our framework effectively integrates multiple non-visual sensory inputs (e.g., audio, haptic, and tactile) gathered through multiple tools (e.g., spoon, fork) and behaviors (e.g., stirring, poking) to perceive object properties. The framework segments interactions into time windows and aligns different modalities, enhancing data efficiency and interactive perception. Additionally, we conducted tool-transfer experiments to evaluate similarities between tools. Our experiments demonstrate that combining multiple tools and behaviors outperforms single-tool and single-behavior approaches. While the audio modality dominates the non-visual multimodal system, other modalities contribute. We further demonstrate that tool similarities vary depending on the behavior, and notably, the robot does not need to complete entire interactions to achieve optimal recognition accuracy.

15:30-15:35, Paper TuCT24.4	Add to My Program
FisheyeDepth: A Real Scale Self-Supervised Depth Estimation Model for Fisheye Camera

Zhao, Guoyang	HKUST(GZ)
Liu, Yuxuan	Hong Kong University of Science and Technology
Qi, Weiqing	HKUST
Ma, Fulong	The Hong Kong University of Science and Technology
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Deep Learning for Visual Perception, Autonomous Vehicle Navigation, Vision-Based Navigation Abstract: Accurate depth estimation is crucial for 3D scene comprehension in robotics and autonomous vehicles. Fisheye cameras, known for their wide field of view, have inherent geometric benefits. However, their use in depth estimation is restricted by a scarcity of ground truth data and image distortions. We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras. We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions, thereby improving depth estimation accuracy and training stability. Furthermore, we incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network. Essentially, this method offers the necessary physical depth for robotic tasks, and also streamlines the training and inference procedures. Additionally, we devise a multi-channel output strategy to improve robustness by adaptively fusing features at various scales, which reduces the noise from real pose data. We demonstrate the superior performance and robustness of our model in fisheye image depth estimation through evaluations on public datasets and real-world scenarios. The project website is available at: https://github.com/guoyangzhao/FisheyeDepth.

15:35-15:40, Paper TuCT24.5	Add to My Program
Geometry-Aware Volumetric Data Stitching Using Local Surface Mapping and Robot Optical Coherence Tomography

Ma, Guangshen	Duke University
Draelos, Mark	University of Michigan
Keywords: Sensor-based Control, Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Optical coherence tomography (OCT) has been widely used for high-fidelity biological tissue scanning but is traditionally limited to small lateral fields of view that preclude large-area scanning. To overcome this problem, we propose an integration of an OCT sensor to a 6-DOF robot arm end-effector combined with a geometry-aware stitching model for surface and volumetric data stitching. We firstly develop a simple but efficient Robot-OCT calibration method by using a three-marker calibration pattern and implement an optimization solver. Given a pre-defined trajectory, a local planner is developed to update the sensor pose by using the OCT point cloud information in order to maintain the effective imaging depth based on the distance and orientation constraints. The system calibration method is verified through repeated experiments with the three-marker targets and the result shows an average testing error of 0.132 +-0.071 mm. The geometry-aware OCT stitching framework is demonstrated based on the experiments of different scanning trajectories and 3D-printed phantoms for large-area scanning. The OCT stitched point cloud is compared with the ground truth from the phantom CAD model and the result show an average surface alignment error of 0.441 +- 0.241 mm for the path following tasks.

15:40-15:45, Paper TuCT24.6	Add to My Program
Thermal Chameleon: Task-Adaptive Tone-Mapping for Radiometric Thermal-Infrared Images

Lee, DongGuw	Seoul National University (SNU)
Kim, Jeongyun	SNU
Cho, Younggun	Inha University
Kim, Ayoung	Seoul National University
Keywords: Object Detection, Segmentation and Categorization, Representation Learning, Deep Learning for Visual Perception Abstract: Thermal Infrared (TIR) imaging provides robust perception for navigating in challenging outdoor environments but faces issues with poor texture and low image contrast due to its 14/16-bit format. Conventional methods utilize various tone-mapping methods to enhance contrast and photometric consistency of TIR images, however, the choice of tone-mapping is largely dependent on knowing the task and temperature dependent priors to work well. In this paper, we present the Thermal Chameleon Network (TCNet), a task-adaptive tone-mapping approach for RAW 14-bit TIR images. Given the same image, TCNet tone-maps different representations of TIR images tailored for each specific task, eliminating the heuristic image rescaling preprocessing and reliance on the extensive prior knowledge of the scene temperature or task-specific characteristics. TCNet exhibits improved generalization performance across object detection and monocular depth estimation, with minimal computational overhead and modular integration to existing architectures for various tasks.


TuDT1 Regular Session, 302	Add to My Program
Award Finalists 4

Chair: Chli, Margarita	ETH Zurich & University of Cyprus
Co-Chair: Kosuge, Kazuhiro	The University of Hong Kong

16:35-16:40, Paper TuDT1.1	Add to My Program
MAC-VO: Metrics-Aware Covariance for Learning-Based Stereo Visual Odometry

Qiu, Yuheng	Carnegie Mellon University
Chen, Yutian	Carnegie Mellon University
Zhang, Zihao	Shanghai Jiao Tong University
Wang, Wenshan	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Keywords: SLAM, Localization, Mapping Abstract: We propose the MAC-VO, a novel learning-based stereo VO that leverages the learned metrics-aware matching uncertainty for dual purposes: selecting keypoint and weighing the residual in pose graph optimization. Compared to traditional geometric methods prioritizing texture-affluent features like edges, our keypoint selector employs the learned uncertainty to filter out the low-quality features based on global inconsistency. In contrast to the learning-based algorithms that model the scale-agnostic diagonal weight matrix for covariance, we design a metrics-aware covariance model to capture the spatial error during keypoint registration and the correlations between different axes. Integrating this covariance model into pose graph optimization enhances the robustness and reliability of pose estimation, particularly in challenging environments with varying illumination, feature density, and motion patterns. On public benchmark datasets, MAC-VO outperforms existing VO algorithms and even some SLAM algorithms in challenging environments. The covariance map also provides valuable information about the reliability of the estimated poses, which can benefit decision-making for autonomous systems.

16:40-16:45, Paper TuDT1.2	Add to My Program
Ground-Optimized 4D Radar-Inertial Odometry Via Continuous Velocity Integration Using Gaussian Process

Yang, Wooseong	Seoul National University
Jang, Hyesu	Seoul National University
Kim, Ayoung	Seoul National University
Keywords: Range Sensing, SLAM, Localization Abstract: Radar ensures robust sensing capabilities in adverse weather conditions, yet challenges remain due to its high inherent noise level. Existing radar odometry has overcome these challenges with strategies such as filtering spurious points, exploiting Doppler velocity, or integrating with inertial measurements. This paper presents two novel improvements beyond the existing radar-inertial odometry: ground-optimized noise filtering and continuous velocity preintegration. Despite the widespread use of ground planes in LiDAR odometry, imprecise ground point distributions of radar measurements cause naive plane fitting to fail. Unlike plane fitting in LiDAR, we introduce a zone-based uncertainty-aware ground modeling specifically designed for radar. Secondly, we note that radar velocity measurements can be better combined with IMU for a more accurate preintegration in radar-inertial odometry. Existing methods often ignore temporal discrepancies between radar and IMU by simplifying the complexities of asynchronous data streams with discretized propagation models. Tackling this issue, we leverage GP and formulate a continuous preintegration method for tightly integrating 3-DOF linear velocity with IMU, facilitating full 6-DOF motion directly from the raw measurements. Our approach demonstrates remarkable performance (less than 1% vertical drift) in public datasets with meticulous conditions, illustrating substantial improvement in elevation accuracy. The code will be released as open source for the community: https://github.com/wooseongY/Go-RIO.

16:45-16:50, Paper TuDT1.3	Add to My Program
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

Tang, Yihe	Stanford University
Huang, Wenlong	Stanford University
Wang, Yingke	Stanford University
Li, Chengshu	Stanford University
Yuan, Roy	Stanford University
Zhang, Ruohan	Stanford University
Wu, Jiajun	Stanford University
Fei-Fei, Li	Stanford University
Keywords: Representation Learning, Deep Learning for Visual Perception, Sensorimotor Learning Abstract: Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods on visual affordance predictions often rely on manually-annotated data or conditions only on predefined set of tasks. We introduce Unsupervised Affordance Distillation (UAD), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes as well as to various human activities despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations.

16:50-16:55, Paper TuDT1.4	Add to My Program
Bat-VUFN: Bat-Inspired Visual-And-Ultrasound Fusion Network for Robust Perception in Adverse Conditions

Lim, Gyeongrok	KAIST
Hong, Jeong-ui	KAIST
Bae, Hyeon Min	Kaist
Keywords: Sensor Fusion, Localization Abstract: Environmental factors like weather and road conditions significantly impact object recognition in autonomous vehicles. While cameras provide rich semantic information, their reliance on electromagnetic waves makes them vulnerable to performance degradation in adverse conditions such as low light and rain. In contrast, ultrasonic sensors offer reliable short-range detection, unaffected by such conditions. We introduce Bat-VUFN, a bio-inspired multi-sensory system that merges camera and ultrasonic data using an Input Quality Score (IQS)-based fusion technique to enhance near-field perception in challenging environments. Bat-VUFN dynamically adjusts sensor contributions based on prevailing conditions, achieving impressive results on the K-Bat dataset (average precision: 0.95, MAE: 0.52m, RMSE: 0.55m), demonstrating its robustness in adverse scenarios.

16:55-17:00, Paper TuDT1.5	Add to My Program
TinySense: A Lighter Weight and More Power-Efficient Avionics System for Flying Insect-Scale Robots

Yu, Zhitao	University of Washington
Tran, Josh	University of Washington
Li, Claire	University of Washington
Weber, Aaron	University of Washington
Talwekar, Yash P.	University of Washington
Fuller, Sawyer	University of Washington
Keywords: Biologically-Inspired Robots, Micro/Nano Robots, Sensor Fusion Abstract: In this paper, we introduce advances in the sensor suite of an autonomous flying insect robot (FIR) weighing less than a gram. FIRs, because of their small weight and size, offer unparalleled advantages in terms of material cost and scalability. However, their size introduces considerable control challenges, notably high-speed dynamics, restricted power, and limited payload capacity. While there have been advancements in developing lightweight sensors, often drawing inspiration from biological systems, no sub-gram aircraft has been able to attain sustained hover without relying on feedback from external sensing such as a motion capture system. The lightest vehicle capable of sustained hovering---the first level of ``sensor autonomy''---is the much larger 28 g Crazyflie. Previous work reported a reduction in size of that vehicle's avionics suite to 187 mg and 21 mW. Here, we report a further reduction in mass and power to only 78.4 mg and 15 mW. We replaced the laser rangefinder with a lighter and more efficient pressure sensor, and built a smaller optic flow sensor around a global-shutter imaging chip. A Kalman Filter (KF) fuses these measurements to estimate the state variables that are needed to control hover: pitch angle, translational velocity, and altitude. Our system achieved performance comparable to that of the Crazyflie's estimator while in flight, with root mean squared errors of 1.573 deg, 0.186 m/s, and 0.136 m, respectively, relative to motion capture.

17:00-17:05, Paper TuDT1.6	Add to My Program
TSCLIP: Robust CLIP Fine-Tuning for Worldwide Cross-Regional Traffic Sign Recognition

Zhao, Guoyang	HKUST(GZ)
Ma, Fulong	The Hong Kong University of Science and Technology
Qi, Weiqing	HKUST
Zhang, Chenguang	Wuhan Polytechnic University
Liu, Yuxuan	Hong Kong University of Science and Technology
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Ma, Jun	The Hong Kong University of Science and Technology
Keywords: Recognition, Computer Vision for Transportation, Autonomous Vehicle Navigation Abstract: Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions for optimizing the model training process. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. Our method surpasses conventional classification benchmark models in cross-regional traffic sign evaluations, and it achieves state-of-the-art performance compared to existing CLIP fine-tuning techniques (Fig. 1). To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: https://github.com/guoyangzhao/TSCLIP.


TuDT2 Regular Session, 301	Add to My Program
Integrating Motion Planning and Learning 1

Chair: Mao, Jiayuan	MIT
Co-Chair: Righetti, Ludovic	New York University

16:35-16:40, Paper TuDT2.1	Add to My Program
Adaptive Abrupt Disturbance Rejection Tracking Control for Wheeled Mobile Robots

Wu, Hao	Huazhong University of Science and Technology
Wang, Shuting	Huazhong University of Science and Technology
Xie, Yuanlong	Huazhong University of Science and Technology
Li, Hu	Huazhong University of Science and Technology
Zheng, Shiqi	China University of Geosciences Wuhan Campus
Jiang, Liquan	Wuhan Textile University
Keywords: Motion Control, Wheeled Robots, Robust/Adaptive Control Abstract: Uncertain disturbances increase the difficulty of robust tracking control for wheeled mobile robots (WMRs) in industrial scenarios, especially when exhibiting abrupt changes. This letter proposes an adaptive abrupt disturbance-rejection sliding mode controller (SMC). To address the increased variability in the disturbance boundaries caused by abrupt transitions, a new adaptive disturbance observer (ADOB) is designed to improve the tracking robustness and weaken the chattering of SMC by generating auxiliary system variables without depending on any prior boundary information about disturbance and its change rate. Then, a novel barrier function (BF)-based switching law is constructed to suppress the residual-disturbance estimation error of the ADOB at the transient state, which achieves the tradeoffs between the necessary sufficient gain and chattering by avoiding gain overestimation. The finite-time Lyapunov stability of the sliding variables and the estimated errors have been proved theoretically. The practical effectiveness is illustrated in experiments with the custom-developed WMRs.

16:40-16:45, Paper TuDT2.2	Add to My Program
OPPA: Online Planner's Parameter Adaptation for Enhanced Mobile Robot Navigation

Chang, Minsu	Samsung Electronics
Jang, Junwon	Samsung Electronics Co., Ltd
Han, Daewoong	Samsung Electronics Co. Ltd
Choi, Wonje	Samsung Electronics
Kim, Seungyeon	Graduate School of Convergence Science and Technology, Seoul Nat
Park, Hyunkyu	Samsung Advanced Institute of Technology
Choi, Hyundo	Samsung Electronics
Keywords: Collision Avoidance, Integrated Planning and Learning, AI-Based Methods Abstract: Autonomous navigation in mobile robots has made significant advancements; however, traditional methods often struggle to adapt in real-time to dynamic or unstructured environments. This paper presents the Online Planner’s Parameter Adaptation (OPPA) framework, which enhances both adaptability and safety in mobile robot navigation by dynamically adjusting planner parameters. OPPA integrates a rule-based system for estimating tunnel width using 2D LiDAR and path data with a learning-based approach utilizing a shallow transformer model. By incorporating a human-in-the-loop process to refine training data, OPPA improves accuracy and reliability in complex environments. Designed for real-time efficiency on resource-constrained platforms, OPPA has been validated through simulation and real-world experiments, demonstrating its ability to enhance both safety and performance. These results highlight OPPA as a viable solution for dynamic and complex robotic applications.

16:45-16:50, Paper TuDT2.3	Add to My Program
Learning to Refine Input Constrained Control Barrier Functions Via Uncertainty-Aware Online Parameter Adaptation

Kim, Taekyung	University of Michigan
Kee, Robin Inho	University of Michigan
Panagou, Dimitra	University of Michigan, Ann Arbor
Keywords: Integrated Planning and Learning, Integrated Planning and Control, Machine Learning for Robot Control Abstract: Control Barrier Functions (CBFs) have become powerful tools for ensuring safety in nonlinear systems. However, finding valid CBFs that guarantee persistent safety and feasibility remains an open challenge, especially in systems with input constraints. Traditional approaches often rely on manually tuning the parameters of the class K functions of the CBF conditions a priori. The performance of CBF-based controllers is highly sensitive to these fixed parameters, potentially leading to overly conservative behavior or safety violations. To overcome these issues, this paper introduces a learning-based optimal control framework for online adaptation of Input Constrained CBF (ICCBF) parameters in discrete-time nonlinear systems. Our method employs a probabilistic ensemble neural network to predict the performance and risk metrics, as defined in this work, for candidate parameters, accounting for both epistemic and aleatoric uncertainties. We propose a two-step verification process using Jensen-Rényi Divergence and distributionally-robust Conditional Value at Risk to identify valid parameters. This enables dynamic refinement of ICCBF parameters based on current state and nearby environments, optimizing performance while ensuring safety within the verified parameter set. Experimental results demonstrate that our method outperforms both fixed-parameter and existing adaptive methods in robot navigation scenarios across safety and performance metrics.

16:50-16:55, Paper TuDT2.4	Add to My Program
GA-TEB: Goal-Adaptive Framework for Efficient Navigation Based on Goal Lines

Zhang, Qianyi	Nankai University
Luo, Wentao	Huawei
Zhang, Ziyang	Huawei, China
Wang, Yaoyuan	Huawei
Liu, Jingtai	Nankai University
Keywords: Motion and Path Planning, Human-Aware Motion Planning Abstract: In crowd navigation, the local goal plays a crucial role in trajectory initialization, optimization, and evaluation. Recognizing that when the global goal is distant, the robot's primary objective is avoiding collisions, making it less critical to pass through the exact local goal point, this work introduces the concept of goal lines, which extend the traditional local goal from a single point to multiple candidate lines. Coupled with a topological map construction strategy that groups obstacles to be as convex as possible, a goal-adaptive navigation framework is proposed to efficiently plan multiple candidate trajectories. Simulations and experiments demonstrate that the proposed GA-TEB framework effectively prevents deadlock situations, where the robot becomes frozen due to a lack of feasible trajectories in crowded environments. Additionally, the framework greatly increases planning frequency in scenarios with numerous non-convex obstacles, enhancing both robustness and safety.

16:55-17:00, Paper TuDT2.5	Add to My Program
Reinforcement Learning for Adaptive Planner Parameter Tuning: A Perspective on Hierarchical Architecture

Lu, Wangtao	Zhejiang University
Wei, Yufei	Zhejiang University
Xu, Jiadong	Zhejiang University
Jia, Wenhao	Zhejiang University of Technology
Li, Liang	Zhejiang Univerisity
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Motion and Path Planning, Reinforcement Learning Abstract: 规划算法的自动参数调整方法，将管道方法与基于学习的方法集成在一起技术被认为很有前途，因为它们稳定性和处理非结构化的能力环境。虽然现有的参数优化方法具有展示了相当大的成功，进一步的性能改进需要更结构化的方法。在这个论文中，我们提出了一种分层架构基于强化学习的参数调整。这 architecture 引入了一个分层结构，其中低频参数调谐、中频规划，以及高频控制，实现同步增强上层参数调整和下层控制通过迭代训练。实验评估模拟环境和真实环境都表明，我们的方法超越了现有的参数调整方法。此外，我们的方法在自主机器人导航（BARN）挑战赛

17:00-17:05, Paper TuDT2.6	Add to My Program
Integrating One-Shot View Planning with a Single Next-Best View Via Long-Tail Multiview Sampling

Pan, Sicong	University of Bonn
Hu, Hao	Fudan University
Wei, Hui	Fudan University
Dengler, Nils	University of Bonn
Zaenker, Tobias	University of Bonn
Elnagdi, Murad	University of Bonn
Bennewitz, Maren	University of Bonn
Keywords: View Planning, Deep Learning in Robotics and Automation, Motion and Path Planning, Computer Vision for Automation Abstract: Existing view planning systems either adopt an iterative paradigm using next-best views (NBV) or a one-shot pipeline relying on the set-covering view-planning (SCVP) network. However, neither of these methods can concurrently guarantee both high-quality and high-efficiency reconstruction of 3D unknown objects. To tackle this challenge, we introduce a crucial hypothesis: with the availability of more information about the unknown object, the prediction quality of the SCVP network improves. There are two ways to provide extra information: (1) leveraging perception data obtained from NBVs, and (2) training on an expanded dataset of multiview inputs. In this work, we introduce a novel combined pipeline that incorporates a single NBV before activating the proposed multiview-activated (MA-)SCVP network. The MA-SCVP is trained on a multiview dataset generated by our long-tail sampling method, which addresses the issue of unbalanced multiview inputs and enhances the network performance. Extensive simulated experiments substantiate that our system demonstrates a significant surface coverage increase and a substantial 45% reduction in movement cost compared to state-of-the-art systems. Real-world experiments justify the capability of our system for generalization and deployment.


TuDT3 Regular Session, 303	Add to My Program
Verification and Formal Methods

Chair: Luo, Xusheng	Carnegie Mellon University
Co-Chair: Liu, Wenliang	Amazon

16:35-16:40, Paper TuDT3.1	Add to My Program
Decomposition-Based Hierarchical Task Allocation and Planning for Multi-Robots under Hierarchical Temporal Logic Specifications

Luo, Xusheng	Carnegie Mellon University
Xu, Shaojun	Zhejiang University
Liu, Ruixuan	Carnegie Mellon University
Liu, Changliu	Carnegie Mellon University
Keywords: Formal Methods in Robotics and Automation, Planning, Scheduling and Coordination, Multi-Robot Systems Abstract: Past research into robotic planning with temporal logic specifications, notably Linear Temporal Logic (LTL), was largely based on single formulas for individual or groups of robots. But with increasing task complexity, LTL formulas unavoidably grow lengthy, complicating interpretation and specification generation, and straining the computational capacities of the planners. A recent development has been the hierarchical representation of LTL~cite{luo2024simultaneous} that contains multiple temporal logic specifications, providing a more interpretable framework. However, the proposed planning algorithm assumes the independence of robots within each specification, limiting their application to multi-robot coordination with complex temporal constraints. In this work, we formulated a decomposition-based hierarchical framework. At the high level, each specification is first decomposed into a set of atomic sub-tasks. We further infer the temporal relations among the sub-tasks of different specifications to construct a task network. Subsequently, a Mixed Integer Linear Program is utilized to assign sub-tasks to various robots. At the lower level, domain-specific controllers are employed to execute sub-tasks. Our approach was experimentally applied to domains of robotic navigation and manipulation. The simulation demonstrated that our approach can find better solutions using less runtimes.

16:40-16:45, Paper TuDT3.2	Add to My Program
Hand It to Me Formally! Data-Driven Control for Human-Robot Handovers with Signal Temporal Logic

Khanna, Parag	KTH Royal Institute of Technology
Fredberg, Jonathan	KTH Royal Institute of Technology
Björkman, Mårten	KTH
Smith, Claes Christian	KTH Royal Institute of Technology
Linard, Alexis	KTH Royal Institute of Technology
Keywords: Formal Methods in Robotics and Automation, Human-Aware Motion Planning, Motion and Path Planning Abstract: To facilitate human-robot interaction (HRI), we aim for robot behavior that is efficient, transparent, and closely resembles human actions. Signal Temporal Logic (STL) is a formal language that enables the specification and verification of complex temporal properties in robotic systems, helping to ensure their correctness. STL can be used to generate explainable robot behaviour, the degree of satisfaction of which can be quantified by checking its STL robustness. In this work, we use data-driven STL inference techniques to model human behavior in human-human interactions, on a handover dataset. We then use the learned model to generate robot behavior in human-robot interactions. We present a handover planner based on inferred STL specifications to command robotic motion in human-robot handovers. We also validate our method in a human-to-robot handover experiment.

16:45-16:50, Paper TuDT3.3	Add to My Program
Forward Invariance in Trajectory Spaces for Safety-Critical Control

Vahs, Matti	KTH Royal Institute of Technology, Stockholm
Cabral Muchacho, Rafael Ignacio	KTH Royal Institute of Technology
Pokorny, Florian T.	KTH Royal Institute of Technology
Tumova, Jana	KTH Royal Institute of Technology
Keywords: Formal Methods in Robotics and Automation, Robot Safety Abstract: Useful robot control algorithms should not only achieve performance objectives but also adhere to hard safety constraints. Control Barrier Functions (CBFs) have been developed to provably ensure system safety through forward invariance. However, they often unnecessarily sacrifice performance for safety since they are purely reactive. Receding horizon control (RHC), on the other hand, consider planned trajectories to account for the future evolution of a system. This work provides a new perspective on safety-critical control by introducing Forward Invariance in Trajectory Spaces (FITS). We lift the problem of safe RHC into the trajectory space and describe the evolution of planned trajectories as a controlled dynamical system. Safety constraints defined over states can be converted into sets in the trajectory space which we render forward invariant via a CBF framework. We derive an efficient quadratic program (QP) to synthesize trajectories that provably satisfy safety constraints. Our experiments support that FITS improves the adherence to safety specifications without sacrificing performance over alternative CBF and NMPC methods.

16:50-16:55, Paper TuDT3.4	Add to My Program
Scalable Multi-Robot Task Allocation and Coordination under Signal Temporal Logic Specifications

Liu, Wenliang	Amazon
Majcherczyk, Nathalie	Worcester Polytechnic Institute
Pecora, Federico	Amazon Robotics
Keywords: Formal Methods in Robotics and Automation, Multi-Robot Systems, Planning, Scheduling and Coordination Abstract: Motion planning with simple objectives, such as collision-avoidance and goal-reaching, can be solved efficiently using modern planners. However, the complexity of the allowed tasks for these planners is limited. On the other hand, signal temporal logic (STL) can specify complex requirements, but STL-based motion planning and control algorithms often face scalability issues, especially in large multi-robot systems with complex dynamics. In this paper, we propose an algorithm that leverages the best of the two worlds. We first use a single-robot motion planner to efficiently generate a set of alternative reference paths for each robot. Then coordination requirements are specified using STL, which is defined over the assignment of paths and robots' progress along those paths. We use a Mixed Integer Linear Program (MILP) to compute task assignments and robot progress targets over time such that the STL specification is satisfied. Finally, a local controller is used to track the target progress. Simulations demonstrate that our method can handle tasks with complex constraints and scales to large multi-robot teams and intricate task allocation scenarios.

16:55-17:00, Paper TuDT3.5	Add to My Program
Planning with Linear Temporal Logic Specifications: Handling Quantifiable and Unquantifiable Uncertainty

Yu, Pian	University College London
Li, Yong	University of Liverpool
Parker, David	University of Oxford
Kwiatkowska, Marta	University of Oxford
Keywords: Formal Methods in Robotics and Automation, Planning under Uncertainty, Task Planning Abstract: This work studies the planning problem for robotic systems under both quantifiable and unquantifiable uncertainty. The objective is to enable the robotic systems to optimally fulfill high-level tasks specified by Linear Temporal Logic (LTL) formulas. To capture both types of uncertainty in a unified modelling framework, we utilise Markov Decision Processes with Set-valued Transitions (MDPSTs). We introduce a novel solution technique for optimal robust strategy synthesis of MDPSTs with LTL specifications. To improve efficiency, our work leverages limit-deterministic B"uchi automata (LDBAs) as the automaton representation for LTL to take advantage of their efficient constructions. To tackle the inherent nondeterminism in MDPSTs, which presents a significant challenge for reducing the LTL planning problem to a reachability problem, we introduce the concept of a Winning Region (WR) for MDPSTs. Additionally, we propose an algorithm for computing the WR over the product of the MDPST and the LDBA. Finally, a robust value iteration algorithm is invoked to solve the reachability problem. We validate the effectiveness of our approach through a case study involving a mobile robot operating in the hexagonal world, demonstrating promising efficiency gains.

17:00-17:05, Paper TuDT3.6	Add to My Program
Lyapunov-Certified Trajectory Tracking for Mobile Robot with a Tail Wheel: Differential-Flatness and Adaptive Backstepping Design

Nishizawa, Yuta	Honda R&D Co., Ltd
Koga, Shumon	Honda Research and Development
Aizawa, Koki	Honda R&D Co., Ltd
Yasui, Yuji	Honda R&D Co., Ltd
Keywords: Nonholonomic Mechanisms and Systems, Robust/Adaptive Control, Wheeled Robots Abstract: This paper proposes a trajectory tracking control law for a mobile robot with two front differential wheels and a tail wheel. The dynamics is given by mimicking Ackerman steering model for the dynamics of position and orientation, associated with the actuator dynamics of the tail wheel's angle modeled by a first-order response with respect to the robot's angular velocity. First we develop a nominal trajectory tracking control law to track a given desired trajectory by applying differential-flatness property of the unicycle model and backstepping approach to handle the actuator dynamics. The effectiveness of the trajectory tracking is demonstrated by conducting hardware robot experiment after performing system identification, which illustrates the superior performance over a benchmark method. The design is also extended to an adaptive tracking control under parameter uncertainty in the tail wheel dynamics through introducing the adaptation law of the parameters, and the performance is demonstrated in numerical simulation.


TuDT4 Regular Session, 304	Add to My Program
Object Detection 2

Chair: Li, Yingke	Massachusetts Institute of Technology
Co-Chair: Joffe, Benjamin	Georgia Institute of Technology

16:35-16:40, Paper TuDT4.1	Add to My Program
OPRNet: Object-Centric Point Reconstruction Network for Multimodal 3D Object Detection in Adverse Weathers

Yoon, Jaehyun	Chonnam National University
Jung, JongWon	CHONNAM University
Lee, Eungi	Chonnam National University
Yoo, Seok Bong	Chonnam National University
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Automation, Sensor Fusion Abstract: The development of a multimodal fusion technique utilizing LiDAR-camera data has enabled precise 3D object detection for self-driving vehicles, particularly in ideal conditions with clear weather. Nevertheless, adverse weathers such as fog, snow, and rain remain a challenge for existing multimodal methods. These conditions lead to a reduced density of point clouds as a result of laser signal occlusion and attenuation. Additionally, as the distance grows, the point cloud becomes sparser, further challenging object detection tasks. To address these problems, we introduce a point reconstruction network employing equirectangular projection tailored for multimodal 3D object detection. This network incorporates a range-constrained noise filter to remove noise caused by adverse weather and an object-centric point generator designed to flexibly generate points for distant objects. Moreover, we propose a dual 2D auxiliary module to enhance image features and support the point reconstruction. Experimental evaluations conducted on adverse weather datasets demonstrate that the suggested approach surpasses current techniques. The implementation can be accessed at https://github.com/jhyoon964/oprnet.

16:40-16:45, Paper TuDT4.2	Add to My Program
Hierarchical Spatiotemporal Fusion for Event-Visible Object Detection

Jhong, Sin-Ye	Tamkang University
Lin, Hsin-Chun	National Taiwan University of Science and Technology
Liu, Tzu-Chi	National Taiwan University of Science and Technology
Hua, Kai-Lung	National Taiwan University of Science and Technology
Chen, Yung-Yao	National Taiwan University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Visual Learning Abstract: Traditional visible light cameras are prone to performance degradation under varying weather and lighting conditions. To address this challenge, we introduce an event-based camera and propose a novel hierarchical spatiotemporal fusion approach for event-visible object detection. Our method enhances detection performance by integrating data from both event-based and visible light cameras. We have designed three key modules: The Gated Event Accumulation Representation module (GEAR), the Temporal Feature Selection module (TFS), and the Adaptive Fusion module (AF). GEAR and TFS enhance temporal feature fusion at both image and feature levels, while AF effectively integrates multi-modal features with low computational complexity. Our approach has been trained and validated on the publicly available DSEC-Detection dataset, achieving mAP50 and mAP50-95 scores of 67.2% and 45.6%, respectively, demonstrating superior detection performance and validating the effectiveness of the proposed method.

16:45-16:50, Paper TuDT4.3	Add to My Program
Dark-DENet: A Lightweight Enhancement Network for Low-Light Object Detection

Wu, Xiaoyu	China University of Geoscience
Shao, Yuxiang	China University of Geoscience
Jin, Xinyu	China University of Geoscience
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Recognition Abstract: Deep learning-based object detection methods have shown significant success, particularly in robotic vision tasks like autonomous navigation and object manipulation. However, their performance drops sharply in low-light con- ditions, challenging robots in poorly lit environments. To address this, we propose Dark-DENet, a lightweight detection- driven enhancement network specifically designed for low- light conditions. Dark-DENet introduces an Improved Global Enhancement Module for low-frequency components to capture multiscale features, and a multi-layer convolutional structure in the Detail Enhancement Module to enhance high-frequency components. Additionally, the Scale-Aware Pooling Fusion Module enriches the semantic information of HF components. Dark-DENet is a plug-and-play network that can be easily integrated into the backbone of various detectors for joint training. Integrated with YOLOv5 as DD-YOLO, and combined with other models like YOLO series, RT-DETR, RetinaNet, and Faster R-CNN, experimental results show Dark-DENet consistently improves detection performance across all models. It effectively enhances latent features under limited runtime, making it a robust solution for robotic vision in low-light environments.

16:50-16:55, Paper TuDT4.4	Add to My Program
CubeDN: Real-Time Drone Detection in 3D Space from Dual mmWave Radar Cubes

Fang, Yuan	University College London
Shi, Fangzhan	University College London
Wei, Xijia	University College London
Chen, Qingchao	Peking University
Chetty, Kevin	University College London
Julier, Simon	University College London
Keywords: Object Detection, Segmentation and Categorization, Sensor Fusion, Surveillance Robotic Systems Abstract: As drone use has become more widespread, there is a critical need to ensure safety and security. A key element of this is robust and accurate drone detection and localization. While cameras and other optical sensors like LiDAR are commonly used for object detection, their performance degrades under adverse lighting and environmental conditions. Therefore, this has generated interest in finding more reliable alternatives, such as millimeter-wave (mmWave) radar. Recent research on mmWave radar object detection has predominantly focused on 2D detection of road users. Although these systems demonstrate excellent performance for 2D problems, they lack the sensing capability to measure elevation, which is essential for 3D drone detection. To address this gap, we propose CubeDN, a single-stage end-to-end radar object detection network specifically designed for flying drones. CubeDN overcomes challenges such as poor elevation resolution by utilizing a dual radar configuration and a novel deep learning pipeline. It simultaneously detects, localizes, and classifies drones of two sizes, achieving decimeter-level tracking accuracy at closer ranges with overall 95% average precision (AP) and 85% average recall (AR). Furthermore, CubeDN completes data processing and inference at 10Hz, making it highly suitable for practical applications.

16:55-17:00, Paper TuDT4.5	Add to My Program
CA-IoU: Central-Gaussian Angle-IoU for Robust Bounding Box Regression

Jang, Junbo	Chung-Ang University
Kim, Dohoon	Chung-Ang University
Paik, Joonki	Chung-Ang University
Keywords: Object Detection, Segmentation and Categorization Abstract: Accurate object detection depends on the precise refinement of bounding box regression. Recent advancements in bounding box regression have introduced a variety of methodologies aimed at reducing the disparity between predicted and ground truth bounding boxes. The prevailing objective functions for bounding box regression typically encompass three key perspectives: 1) Intersection over Union (IoU), 2) distance between central points, and 3) aspect ratio alignment. Nonetheless, these existing loss functions encounter two primary challenges including slow convergence of the distance term and aspect ratio variation irrelevant to bounding box localization. This paper presents two novel loss terms to address these challenges. Firstly, we introduce the concept of the Integral of Central-Gaussian, a novel approach that leverages the cumulative distribution function (CDF) derived from a closed-form Gaussian distribution based on the central points of bounding boxes. Secondly, we introduce an alternative aspect ratio representation by minimizing the angle between two bounding boxes in direct proportion to their IoU. We term this comprehensive loss function ``Central-Gaussian Angle-IoU" (CA-IoU), seamlessly incorporating the Integral of Central-Gaussian with angle-based IoU. Extensive experiments on various models and benchmarks for object detection highlight the superior performance of CA-IoU loss compared to existing bounding box regression methods. The source code and the corresponding trained models will be made available.

17:00-17:05, Paper TuDT4.6	Add to My Program
On Onboard LiDAR-Based Flying Object Detection

Vrba, Matous	Faculty of Electrical Engineering, Czech Technical University In
Walter, Viktor	Czech Technical University
Pritzl, Vaclav	Czech Technical University in Prague
Pliska, Michal	Czech Technical University in Prague, Faculty of Electrical Engi
Baca, Tomas	Czech Technical University in Prague FEE
Spurny, Vojtech	Ceske Vysoke Uceni Technicke V Praze, FEL
Hert, Daniel	Czech Technical University in Prague
Saska, Martin	Czech Technical University in Prague
Keywords: Aerial Systems: Perception and Autonomy, Multi-Robot Systems, Object Detection, Segmentation and Categorization, Autonomous Aerial Interception Abstract: A new robust and accurate approach for the detection and localization of flying objects with the purpose of highly dynamic aerial interception and agile multi-robot interaction is presented in this paper. The approach is proposed for use on board of autonomous aerial vehicles equipped with a 3D LiDAR sensor. It relies on a novel 3D occupancy voxel mapping method for the target detection that provides high localization accuracy and robustness with respect to varying environments and appearance changes of the target. In combination with a proposed cluster-based multi-target tracker, sporadic false positives are suppressed, state estimation of the target is provided, and the detection latency is negligible. This makes the system suitable for tasks of agile multi-robot interaction, such as autonomous aerial interception or formation control where fast, precise, and robust relative localization of other robots is crucial. We evaluate the viability and performance of the system in simulated and real-world experiments which demonstrate that at a range of 20m, our system is capable of reliably detecting a micro-scale UAV with an almost 100% recall, 0.2m accuracy, and 20ms delay.


TuDT5 Regular Session, 305	Add to My Program
Aerial Robots: Planning and Control

Chair: Zheng, Minghui	Texas A&M University
Co-Chair: Faigl, Jan	Czech Technical University in Prague

16:35-16:40, Paper TuDT5.1	Add to My Program
Improving Disturbance Estimation and Suppression Via Learning among Systems with Mismatched Dynamics

Modi, Harsh Jashvantbhai	Texas A&M University
Chen, Zhu	University at Buffalo
Liang, Xiao	Texas A&M University
Zheng, Minghui	Texas A&M University
Keywords: Aerial Systems: Applications, Motion Control Abstract: Iterative learning control (ILC) is a method for reducing system tracking or estimation errors over multiple iterations by using information from past iterations. The disturbance observer (DOB) is used to estimate and mitigate disturbances within the system, while the system is being affected by them. ILC enhances system performance by introducing a feedforward signal in each iteration. However, its effectiveness may diminish if the conditions change during the iterations. On the other hand, although DOB effectively mitigates the effects of new disturbances, it cannot entirely eliminate them as it operates reactively. Therefore, neither ILC nor DOB alone can ensure sufficient robustness in challenging scenarios. This study focuses on the simultaneous utilization of ILC and DOB to enhance system robustness. The proposed methodology specifically targets dynamically different linearized systems performing repetitive tasks. The systems share similar forms but differ in dynamics (e.g. sizes, masses, and controllers). Consequently, the design of learning filters must account for these differences in dynamics. To validate the approach, the study establishes a theoretical framework for designing learning filters in conjunction with DOB. The validity of the framework is then confirmed through numerical studies and experimental tests conducted on unmanned aerial vehicles (UAVs). Although UAVs are nonlinear systems, the study employs a linearized controller as they operate in proximity to the hover condition.

16:40-16:45, Paper TuDT5.2	Add to My Program
Learning Speed Adaptation for Flight in Clutter

Zhao, Guangyu	Zhejiang University
Wu, Tianyue	Zhejiang University
Chen, Yeke	Zhejiang University
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Applications, Motion and Path Planning, Reinforcement Learning Abstract: Animals learn to adapt speed of their movements to their capabilities and the environment they observe. Mobile robots should also demonstrate this ability to trade-off aggressiveness and safety for efficiently accomplishing tasks. The aim of this work is to endow flight vehicles with the ability of speed adaptation in prior unknown and partially observable cluttered environments. We propose a hierarchical learning and planning system where we utilize both well-established methods of model-based trajectory generation and trial-and-error that comprehensively learns a policy to dynamically configure the speed constraint. Technically, we use online reinforcement learning to obtain the deployable policy. The statistical results in simulation demonstrate the advantages of our method over the constant speed constraint baselines and an alternative method in terms of flight efficiency and safety. In particular, the policy behaves perception awareness, which distinguish it from alternative approaches. By deploying the policy to hardware, we verify that these advantages can be brought to the real world.

16:45-16:50, Paper TuDT5.3	Add to My Program
Design, Contact Modeling, and Collision-Inclusive Planning of a Dual-Stiffness Aerial RoboT (DART)

Kumar, Yogesh	Arizona State University
Patnaik, Karishma	Arizona State University
Zhang, Wenlong	Arizona State University
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Motion Control Abstract: Collision-resilient quadrotors have gained significant attention for operating in cluttered environments and leveraging impacts to perform agile maneuvers. However, existing designs are typically single-mode: either safeguarded by propeller guards that prevent deformation or deformable but lacking rigidity, which is crucial for stable flight in open environments. This paper introduces DART, a Dual-stiffness Aerial RoboT, that adapts its post-collision response by either engaging a locking mechanism for a rigid mode or disengaging it for a flexible mode, respectively. Comprehensive characterization tests highlight the significant difference in post-collision responses between its rigid and flexible modes, with the rigid mode offering seven times higher stiffness compared to the flexible mode. To understand and harness the collision dynamics, we propose a novel collision response prediction model based on the linear complementarity system theory. We demonstrate the accuracy of predicting collision forces for both the rigid and flexible modes of DART. Experimental results confirm the accuracy of the model and underscore its potential to advance collision-inclusive trajectory planning in aerial robotics.

16:50-16:55, Paper TuDT5.4	Add to My Program
Learning Quadrotor Control from Visual Features Using Differentiable Simulation

Heeg, Johannes	University of Zürich
Song, Yunlong	University of Zurich
Scaramuzza, Davide	University of Zurich
Keywords: Aerial Systems: Mechanics and Control, Machine Learning for Robot Control, Reinforcement Learning Abstract: The sample inefficiency of reinforcement learning (RL) remains a significant challenge in robotics. RL requires large-scale simulation and can still cause long training times, slowing research and innovation. This issue is particularly pronounced in vision-based control tasks where reliable state estimates are not accessible. Differentiable simulation offers an alternative by enabling gradient back-propagation through the dynamics model, providing low-variance analytical policy gradients and, hence, higher sample efficiency. However, its usage for real-world robotic tasks has yet been limited. This work demonstrates the great potential of differentiable simulation for learning quadrotor control. We show that training in differentiable simulation significantly outperforms model-free RL in terms of both sample efficiency and training time, allowing a policy to learn to recover a quadrotor in seconds when providing vehicle states and in minutes when relying solely on visual features. The key to our success is two-fold. First, the use of a simple surrogate model for gradient computation greatly accelerates training without sacrificing control performance. Second, combining state representation learning with policy learning enhances convergence speed in tasks where only visual features are observable. These findings highlight the potential of differentiable simulation for real-world robotics and offer a compelling alternative to conventional RL approaches.

16:55-17:00, Paper TuDT5.5	Add to My Program
Real-Time Planning of Minimum-Time Trajectories for Agile UAV Flight

Teissing, Krystof	Czech Technical University in Prague
Novosad, Matej	Czech Technical University in Prague
Penicka, Robert	Czech Technical University in Prague
Saska, Martin	Czech Technical University in Prague
Keywords: Aerial Systems: Applications, Motion and Path Planning Abstract: We address the challenge of real-time planning of minimum-time trajectories over multiple waypoints, onboard multirotor UAVs. Previous works demonstrated that achieving a truly time-optimal trajectory is computationally too demanding to enable frequent replanning during agile flight, especially on less powerful flight computers. Our approach overcomes this stumbling block by utilizing a point-mass model with a novel iterative thrust decomposition algorithm, enabling the UAV to use all of its collective thrust, something previous point-mass approaches could not achieve. The approach enables gravity and drag modeling integration, significantly reducing tracking errors in high-speed trajectories, which is proven through an ablation study. When combined with a new multi-waypoint optimization algorithm, which uses a gradient-based method to converge to optimal velocities in waypoints, the proposed method generates minimum-time multi-waypoint trajectories within milliseconds. The proposed approach, which we provide as open-source package, is validated both in simulation and in real-world, using Nonlinear Model Predictive Control. With accelerations of up to 3.5g and speeds over 100 km/h, trajectories generated by the proposed method yield similar or even smaller tracking errors than the trajectories generated for a full multirotor model.

17:00-17:05, Paper TuDT5.6	Add to My Program
Variable Time-Step MPC for Agile Multi-Rotor UAV Interception of Dynamic Targets

Ghotavadekar, Atharva	BITS Pilani K.K.Birla Goa Campus
Nekovar, Frantisek	Czech Technical University in Prague
Saska, Martin	Czech Technical University in Prague
Faigl, Jan	Czech Technical University in Prague
Keywords: Aerial Systems: Applications, Motion and Path Planning, Autonomous Vehicle Navigation Abstract: Agile trajectory planning can improve the efficiency of multi-rotor Uncrewed Aerial Vehicles (UAVs) in scenarios with combined task-oriented and kinematic trajectory planning, such as monitoring spatio-temporal phenomena or intercepting dynamic targets. Agile planning using existing non-linear model predictive control methods is limited by the number of planning steps as it becomes increasingly computationally demanding. This reduces the prediction horizon length, which leads to a decrease in solution quality. Besides, the fixed time-step length limits the utilization of the available UAV dynamics in the target neighbourhood. In this paper, we propose to address these limitations by introducing variable time-steps and coupling them with the prediction horizon length. A simplified point-mass motion primitive is used to leverage the differential flatness of quadrotor dynamics and the generation of feasible trajectories in the flat output space. Based on evaluation results and experimentally validated deployment, the proposed method increases the solution quality by enabling planning for long flight segments but allowing tightly sampled maneuvering.


TuDT6 Regular Session, 307	Add to My Program
Perception for Medical Robotics

Chair: Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Co-Chair: Valdastri, Pietro	University of Leeds

16:35-16:40, Paper TuDT6.1	Add to My Program
REMOTE: Real-Time Ego-Motion Tracking for Various Endoscopes Via Multimodal Visual Feature Learning

Shao, Liangjing	Fudan University
Chen, Benshuang	Fudan University
Zhao, Shuting	Fudan University
Chen, Xinrong	Fudan University
Keywords: Computer Vision for Medical Robotics, Deep Learning for Visual Perception, Visual Tracking Abstract: Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: https://remote-bmxs.netlify.app.

16:40-16:45, Paper TuDT6.2	Add to My Program
Intraoperative Trocar-Based Eyeball Rotation Estimation Using Only 2D Microscope Images

Yang, Junjie	TUM
Inagaki, Satoshi	NSK.Ltd
Zhao, Zhihao	Technische Universität München
Zapp, Daniel	Klinikum Rechts Der Isar Der TU München
Maier, Mathias	Klinikum Rechts Der Isar Der TU München
Issa, Peter Charbel	Klinikum Rechts Der Isar, Technical University of Munich
Huang, Kai	Sun Yat-Sen University
Navab, Nassir	TU Munich
Nasseri, M. Ali	Technische Universitaet Muenchen
Keywords: Computer Vision for Medical Robotics, Visual Tracking, Recognition Abstract: In ophthalmic surgery, surgeons or robots manipulate a light probe and an instrument around two separated trocars following sclerotomy to achieve orbital control for eyeball pose adjustment and subsequent surgical tasks referring to microscope frames. However, current methods face significant challenges in directly extracting the eyeball pose from real-time microscope frames due to the limited microscope perspective and the darkened operating room (OR). This paper decomposes eyeball rotations only along the x and y axes. Then, a method of calculating eyeball poses using eyeball geometry and microscopic trocar positions is presented. This method is tested by simulation and a phantom system with current [-2.0, 2.8] degree error, providing assistant intraoperative eyeball status in the dark OR with extended method discussions.

16:45-16:50, Paper TuDT6.3	Add to My Program
Toward Zero-Shot Learning for Visual Dehazing of Urological Surgical Robots

Wu, Renkai	Shanghai University
Wang, Xianjin	Ruijin Hospital, Shanghai Jiaotong University School of Medicine
Liang, Pengchen	Ruijin Hospital, Shanghai Jiaotong University School of Medicine
Zhang, Zhenyu	Nanjing University
Chang, Qing	Ruijin Hospital, Shanghai Jiao Tong University School of Medicin
Tang, Hao	Peking University
Keywords: Computer Vision for Medical Robotics, Data Sets for Robotic Vision, Robotics and Automation in Life Sciences Abstract: Robot-assisted surgery has profoundly influenced current forms of minimally invasive surgery. However, in transurethral urological surgical robots, they need to work in a liquid environment. This causes vaporization of the liquid when shearing and heating is performed, resulting in bubble atomization that affects the visual perception of the robot. This can lead to the need for uninterrupted pauses in the surgical procedure, which makes the surgery take longer. To address the atomization characteristics of liquids under urological surgical robotic vision, we propose an unsupervised zero-shot dehaze method (RSF-Dehaze). Specifically, the proposed Region Similarity Filling Module (RSFM) of RSF-Dehaze significantly improves the recovery of blurred region tissues. In addition, we organize and propose a dehaze dataset for robotic vision in urological surgery (USRobot-Dehaze dataset). In particular, this dataset contains the three most common urological surgical robot operation scenarios. To the best of our knowledge, we are the first to organize and propose a publicly available dehaze dataset for urological surgical robot vision. The proposed RSF-Dehaze proves the effectiveness of our method in three urological surgical robot operation scenarios with extensive comparative experiments with 20 most classical and advanced dehazing and image recovery algorithms. The proposed source code and dataset are available at https://github.com/wurenkai/RSF-Dehaze .

16:50-16:55, Paper TuDT6.4	Add to My Program
Sim2Real within 5 Minutes: Efficient Domain Transfer with Stylized Gaussian Splatting for Endoscopic Images

Wu, Junyang	Shanghai Jiao Tong University
Gu, Yun	Shanghai Jiao Tong University
Yang, Guang-Zhong	Shanghai Jiao Tong University
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Surgical Robotics: Planning Abstract: Robot assisted endoluminal intervention is an emerging technique for both benign and malignant luminal lesions. With vision-based navigation, when combined with pre-operative imaging data as priors, it is possible to recover position and pose of the endoscope without the need of additional sensors. In practice, however, aligning pre-operative and intra-operative domains is complicated by significant texture differences. Although methods such as style transfer can be used to address this issue, they require large datasets from both source and target domains with prolonged training times. This paper proposes an efficient domain transfer method based on stylized Gaussian splatting, only requiring a few of real images (10 images) with very fast training time. Specifically, the transfer process includes two phases. In the first phase, the 3D models reconstructed from CT scans are represented as differential Gaussian point clouds. In the second phase, only color appearance related parameters are optimized to transfer the style and preserve the visual content. A novel structure consistency loss is applied to latent features and depth levels to enhance the stability of the transferred images. Detailed validation was performed to demonstrate the performance advantages of the proposed method compared to that of the current state-of-the-art, highlighting the potential for intra-operative surgical navigation.

16:55-17:00, Paper TuDT6.5	Add to My Program
Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-Driven Surface Normal-Aware Tracking and Mapping

Huang, Yiming	The Chinese University of Hong Kong
Cui, Beilei	The Chinese University of Hong Kong
Bai, Long	The Chinese University of Hong Kong
Chen, Zhen	Centre for Artificial Intelligence and Robotics (CAIR), Hong Kon
Wu, Jinlin	Institute of Automation, Chinese Academy of Sciences
Li, Zhen	Qilu Hospital of Shandong University
Liu, Hongbin	Hong Kong Institute of Science & Innovation, Chinese Academy Of
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: SLAM, Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics Abstract: Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorporating SLAM and 3DGS leads to mismatches between the reconstructed frames. In this work, we present Endo-2DTAM, a real-time endoscopic SLAM system with 2D Gaussian Splatting (2DGS) to address these challenges. Endo-2DTAM incorporates a surface normal-aware pipeline, which consists of tracking, mapping, and bundle adjustment modules for geometrically accurate reconstruction. Our robust tracking module combines point-to-point and point-to-plane distance metrics, while the mapping module utilizes normal consistency and depth distortion to enhance surface reconstruction quality. We also introduce a pose-consistent strategy for efficient and geometrically coherent keyframe sampling. Extensive experiments on public endoscopic datasets demonstrate that Endo-2DTAM achieves an RMSE of 1.87±0.63 mm for depth reconstruction of surgical scenes while maintaining computationally efficient tracking, high-quality visual appearance, and real-time rendering. Our code will be released at github.com/lastbasket/Endo-2DTAM.

17:00-17:05, Paper TuDT6.6	Add to My Program
HFUS-NeRF: Hybrid Representation for Fast Ultrasound Reconstruction in Robotic Ultrasound System

Zhang, Shuai	Hefei University of Technology
Zhao, Cancan	Hefei University of Technology
Ouyang, Bo	Hefei University of Technology
Keywords: Health Care Management, Computer Vision for Medical Robotics, AI-Enabled Robotics Abstract: Telemedicine is promising in digital healthcare management, such as supporting the coronavirus disease 2019 (COVID-19) pandemic. Three-dimensional (3D) ultrasound reconstruction and new view image synthesis, which can assist in diagnosis and reexamine, have significant potential in tele-ultrasound, especially integrating robotic ultrasound systems (RUSS). Neural Radiance Field (NeRF), an impressive reconstruction method, requires long training times, limiting its practicality in ultrasound. Despite NeRF variants achieving faster optimization, their performance remains confined to natural scene reconstructions. To address this limitation, we propose HFUS-NeRF, a hybrid representation method designed for fast and accurate ultrasound reconstruction. HFUS-NeRF integrates multi-resolution hash-grid and tri-plane representations to represent each sampling point of the ultrasonic wave. A unified model for sampling points from different ultrasonic probes is presented to simulate the wave's propagation through tissues, and the final ultrasound image is rendered using volume rendering. Compared with NeRF-based ultrasound reconstruction, both the hash grid and tri-plane resolutions can be scaled up more efficiently, improving reconstruction speed. Experimental results demonstrate that HFUS-NeRF enhances reconstruction quality while significantly reducing reconstruction time to mere minutes. Furthermore, we validated HFUS-NeRF’s adaptability by reconstruction using images from different types of ultrasound probes, and real-world experiments confirmed its feasibility and transferability, enabling fast ultrasound reconstruction on human subjects.


TuDT7 Regular Session, 309	Add to My Program
Marine Robotics 2

Chair: Sukhatme, Gaurav	University of Southern California
Co-Chair: Englot, Brendan	Stevens Institute of Technology

16:35-16:40, Paper TuDT7.1	Add to My Program
Mission-Oriented Gaussian Process Motion Planning for UUVs Over Complex Seafloor Terrain and Current Flows

Huang, Yewei	Stevens Institute of Technology
Lin, Xi	Stevens Institute of Technology
Hernandez-Rocha, Mariana	Stevens Institute of Technology
Narain, Sanjai	Peraton Labs
Pochiraju, Kishore	Stevens Institute of Technology
Englot, Brendan	Stevens Institute of Technology
Keywords: Marine Robotics, Motion and Path Planning, Constrained Motion Planning Abstract: We present a novel motion planning framework for unmanned underwater vehicles (UUVs) - the first framework that applies Gaussian process motion planning to solve a 3D path planning problem for a 6-DoF robot in underwater environments. We address missions requiring UUVs to remain in close proximity to seafloor terrain, which must be achieved alongside collision avoidance. Our framework also considers the influence of current flows as part of the cost function, allowing for more accurate planning. To evaluate the performance of our proposed framework, we compare it with the widely used RRT* and STOMP algorithms over a range of underwater environments. Our experimental results demonstrate the stability and time efficiency of our framework.

16:40-16:45, Paper TuDT7.2	Add to My Program
Word2Wave: Language Driven Mission Programming for Efficient Subsea Deployments of Marine Robots

Chen, Ruo	University of Florida
Blow, David	University of Florida
Abdullah, Adnan	University of Florida
Islam, Md Jahidul	University of Florida
Keywords: Marine Robotics, Field Robots Abstract: This paper explores the design and development of a language-based interface for dynamic mission programming of autonomous underwater vehicles (AUVs). The proposed "Word2Wave" (W2W) framework enables interactive programming and parameter configuration of AUVs for remote subsea missions. The W2W framework includes: (i) a set of novel language rules and command structures for efficient language-to-mission mapping; (ii) a GPT-based prompt engineering module for training data generation; (iii) a small language model (SLM)-based sequence-to-sequence learning pipeline for mission command generation from human speech or text; and (iv) a novel user interface for 2D mission map visualization and human-machine interfacing. The proposed learning pipeline adapts an SLM named T5-Small that can learn language-to-mission mapping from processed language data effectively, providing robust and efficient performance. In addition to a benchmark evaluation with state-of-the-art, we conduct a user interaction study to demonstrate the effectiveness of W2W over commercial AUV programming interfaces. Across participants, W2W-based programming required less than 10% time for mission programming compared to traditional interfaces; it is deemed to be a simpler and more natural paradigm for subsea mission programming with a usability score of 76.25. W2W opens up promising future research opportunities on hands-free AUV mission programming for efficient subsea deployments.

16:45-16:50, Paper TuDT7.3	Add to My Program
Three-Dimensional Obstacle Avoidance and Path Planning for Unmanned Underwater Vehicles Using Elastic Bands (I)

Amundsen, Herman Biørn	NTNU
Føre, Martin	NTNU
Ohrem, Sveinung Johan	SINTEF Ocean AS
Haugaløkken, Bent	SINTEF Ocean
Kelasidi, Eleni	NTNU
Keywords: Collision Avoidance, Path Planning for Multiple Mobile Robots or Agents, Field Robots Abstract: Unmanned underwater vehicles (UUVs) have become indispensable tools for inspection, maintenance, and repair (IMR) operations in the underwater domain. The major focus and novelty of this work is collision-free autonomous navigation of UUVs in dynamically changing environments. Path planning and obstacle avoidance are fundamental concepts for enabling autonomy for mobile robots. This remains a challenge, particularly for underwater vehicles operating in complex and dynamically changing environments. The elastic band method has been a suggested method for planning collision-free paths and is based on modeling the path as a dynamic system that will continuously be reshaped based on its surroundings. This article proposes adaptations to the method for underwater applications and presents a thorough investigation of the method for 3-D path planning and obstacle avoidance, both through simulations and extensive lab and field experiments. In the experiments, the method was used by a UUV operating autonomously at an industrial-scale fish farm and demonstrated that the method was able to successfully guide the vehicle through a challenging and constantly changing environment. The proposed work has broad applications for field deployment of marine robots in environments that require the vehicle to quickly react to changes in its surroundings.

16:50-16:55, Paper TuDT7.4	Add to My Program
A Data-Driven Velocity Estimator for Autonomous Underwater Vehicles Experiencing Unmeasurable Flow and Wave Disturbance

Cai, Jinzhi	Hong Kong University of Science and Technology
Mayberry, Scott	Georgia Institute of Technology
Yin, Huan	Hong Kong University of Science and Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Keywords: Marine Robotics, AI-Based Methods, Software-Hardware Integration for Robot Systems Abstract: Autonomous Underwater Vehicles (AUVs) encounter significant challenges in confined spaces like ports and testing tanks, where vehicle-environment interactions, such as wave reflections and unsteady flows, introduce complex, time-varying disturbances. Model-based state estimation methods can struggle to handle these dynamics, leading to localization errors. To address this, we propose a data-driven velocity estimation approach using Inertial Measurement Units (IMUs) and a Gated Recurrent Unit (GRU) neural network, capturing temporal dependencies and rejecting external disturbances. This velocity estimator is then integrated into a sensor fusion framework using an asynchronous Kalman filter to improve localization by fusing on-board and off-board sensor information. Experimental validation on miniature AUVs demonstrates the effectiveness of the proposed method in enhancing accuracy for velocity and position estimation in environments with significant disturbances due to interactions between the vehicle and the environment.

16:55-17:00, Paper TuDT7.5	Add to My Program
Dynamic End Effector Trajectory Tracking for Small-Scale Underwater Vehicle-Manipulator Systems (UVMS): Modeling, Control, and Experimental Validation

Trekel, Niklas	University of Bonn
Bauschmann, Nathalie	Hamburg University of Technology
Alff, Thies Lennart	Technische Universität Hamburg
Duecker, Daniel Andre	Technical University of Munich (TUM)
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Seifried, Robert	Hamburg University of Technology
Keywords: Marine Robotics, Field Robots Abstract: With the ongoing miniaturization, recently, lightweight, commercial underwater vehicle manipulator systems (UVMSs) have emerged that massively lower the entry barrier into underwater manipulation. Within this research field, dynamic and accurate end effector trajectory tracking is a crucial first step in developing autonomous capabilities. In this context, coupling effects between the manipulator and vehicle dynamics are expected to pose a considerable challenge. However, UVMS control strategies analyzed in detailed experimental studies are particularly rare. We present a holistic approach based on task-priority control that we describe and discuss from modeling towards extensive experimental studies, which are crucial for the notoriously hard-to-simulate underwater domain. We demonstrate this framework on the widely used platform of a BlueROV2 and an Alpha 5 manipulator. The end effector trajectory tracking is shown to be highly accurate, with < 4 cm median position error. Moreover, our experimental findings on the consideration of dynamic coupling within UVMS control motivate further research. The code is available at https://github.com/HippoCampusRobotics/uvms. A video of the results is available at https://youtu.be/IDMlI5KqlVI.

17:00-17:05, Paper TuDT7.6	Add to My Program
DeepVL: Dynamics and Inertial Measurements-Based Deep Velocity Learning for Underwater Odometry

Singh, Mohit	NTNU: Norwegian University of Science and Technology
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Keywords: Marine Robotics, Visual-Inertial SLAM Abstract: This paper presents a learned model to predict the robot-centric velocity of an underwater robot through dynamics-aware proprioception. The method exploits a recurrent neural network using as inputs inertial cues, motor commands, and battery voltage readings alongside the hidden state of the previous time-step to output robust velocity estimates and their associated uncertainty. An ensemble of networks is utilized to enhance the velocity and uncertainty predictions. Fusing the network's outputs into an Extended Kalman Filter, alongside inertial predictions and barometer updates, the method enables long-term underwater odometry without further exteroception. Furthermore, when integrated into visual-inertial odometry, the method assists in enhanced estimation resilience when dealing with an order of magnitude fewer total features tracked (as few as 1) as compared to conventional visual-inertial systems. Tested onboard an underwater robot deployed both in a laboratory pool and the Trondheim Fjord, the method takes less than 5ms for inference either on the CPU or the GPU of an NVIDIA Orin AGX and demonstrates less than 4% relative position error in novel trajectories during complete visual blackout, and approximately 2% relative error when a maximum of 2 visual features from a monocular camera are available.


TuDT8 Regular Session, 311	Add to My Program
Representation Learning 1

Chair: Katz, Sydney	Stanford University
Co-Chair: Pinto, Lerrel	New York University

16:35-16:40, Paper TuDT8.1	Add to My Program
A Frequency-Based Attention Neural Network and Subject-Adaptive Transfer Learning for sEMG Hand Gesture Classification

Nguyen, Phuc Thanh-Thien	National Taiwan University of Science and Technology
Su, Shun-Feng	National Taiwan University of Science and Technology
Kuo, Chung-Hsien	National Taiwan University
Keywords: Gesture, Posture and Facial Expressions, Transfer Learning Abstract: This study introduces a novel approach for real-time hand gesture classification through the integration of a Frequency-based Attention Neural Network (FANN) with Subject-Adaptive Transfer Learning, specifically tailored for surface electromyography (sEMG) data. By utilizing the Fourier transform, the proposed methodology leverages the inherent frequency characteristics of sEMG signals to enhance the discriminative features for accurate gesture recognition. Additionally, the subject-adaptive transfer learning strategy is employed to improve model generalization across different individuals. The combination of these techniques results in an effective and versatile system for sEMG-based hand gesture classification, demonstrating promising performance in adapting individual variability and improving classification accuracy. The proposed method’s performance is evaluated and compared with established approaches using the publicity available NinaPro DB5 dataset. Notably, the proposed simple model, coupled with frequency-based attention modules, achieves accuracies of 89.56% with a quick prediction time of 5ms, showcasing its potential for dexterous control of robots and bionic hands. The findings of this research contribute to the advancement of gesture recognition systems, particularly in the domains of human-computer interaction and prosthetic control.

16:40-16:45, Paper TuDT8.2	Add to My Program
P3-PO: Prescriptive Point Priors for Visuo-Spatial Generalization of Robot Policies

Levy, Mara	University of Maryland, College Park
Haldar, Siddhant	New York University
Pinto, Lerrel	New York University
Shrivastava, Abhinav	University of Maryland, College Park
Keywords: Deep Learning Methods, Representation Learning, Visual Learning Abstract: Developing generalizable robot policies that can robustly handle varied environmental conditions and object instances remains a fundamental challenge in robot learning. While considerable efforts have focused on collecting large robot datasets and developing policy architectures to learn from such data, naively learning from visual inputs often results in brittle policies that fail to transfer beyond the training data. This work presents Prescriptive Point Priors for Policies or P3-PO, a novel framework that constructs a unique state representation of the environment leveraging recent advances in computer vision and robot learning to achieve improved out-of-distribution generalization for robot manipulation. This representation is obtained through two steps. First, a human annotator prescribes a set of semantically meaningful points on a single demonstration frame. These points are then propagated through the dataset using off-the-shelf vision models. The derived points serve as an input to state-of-the-art policy architectures for policy learning. Our experiments across four real-world tasks demonstrate an overall 43% absolute improvement over prior methods when evaluated in identical settings as training. Further, P3-PO exhibits 58% and 80% gains across tasks for new object instances and more cluttered environments respectively. Videos illustrating the robot's performance are best viewed at point-priors.github.io.

16:45-16:50, Paper TuDT8.3	Add to My Program
APA-BI: Adaptive Partition Aggregation and Bidirectional Integration for UAV-View Geo-Localization

Zhang, Xichen	Northeastern University
Zhao, Shuying	Northeastern University
Zhang, Yunzhou	Northeastern University
Ge, Fawei	Northeastern University
Zhao, Bin	Northeastern University
Zhang, Yizhong	Northeastern University
Keywords: Representation Learning, Localization, Deep Learning Methods Abstract: The task of UAV-view geo-localization is to match a query image with database images to estimate the current geographic location of the query image. This is particularly useful in environments where GPS is not available or when the device fails. Although deep learning methods make sufficient progress in UAV-view geo-localization, they still face challenges in improving the distinguishability of features. For instance, some feature aggregation methods do not consider semantic integrity, and robust elements in the image are not given enough attention. This paper proposes a UAV-view geo-localization method (APA-BI) to tackle the above issues. Specifically, we propose an adaptive partition aggregation method to ensure feature integrity at the semantic level by increasing the receptive field of the classifier module. At the same time, we design a bidirectional integration module to further enhance feature distinguishability by extracting robust tubular topological structures from images. Experimental results on public datasets demonstrate that APA-BI achieves impressive retrieval accuracy and outperforms most state-of-the-art methods. Moreover, the test results of APA-BI in real-world scenarios also show excellent performance.

16:50-16:55, Paper TuDT8.4	Add to My Program
Robo-MUTUAL: Robotic Multimodal Task Specification Via Unimodal Learning

Li, Jianxiong	Tsinghua University
Wang, Zhihao	Peking University
Zheng, Jinliang	Tsinghua University
Zhou, Xiaoai	University of Toronto
Wang, Guanming	University College London
Song, Guanglu	SenseTime Research
Liu, Yu	SenseTime Group Limited
Liu, Jingjing	Institute for AI Industry Research (AIR), Tsinghua University
Zhang, Ya-Qin	Institute for AI Industry Research(AIR), Tsinghua University
Yu, Junzhi	Peking University
Zhan, Xianyuan	Tsinghua University
Keywords: Representation Learning, Imitation Learning Abstract: Multimodal task specification is essential for enhanced robotic performance, where Cross-modality Alignment enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong Crossmodality Alignment capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant potential in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo MUTUAL

16:55-17:00, Paper TuDT8.5	Add to My Program
Model Free Method of Screening Training Data for Adversarial Datapoints through Local Lipschitz Quotient Analysis

Kamienski, Emily	Massachusetts Institute of Technology
Asada, Harry	MIT
Keywords: Deep learning methods, data sets for robot learning, Lipschitz quotient, data preparation, adversarial data Abstract: It is often challenging to pick suitable data features for learning problems. Sometimes certain regions of the data are harder to learn because they are not well characterized by the data features. The challenge is amplified when resources for sensing and computation are limited and time-critical yet reliable decisions must be made. For example, a robotic system for preventing falls of elderly people needs a real-time fall predictor, with low false positive and false negative rates, using a simple wearable sensor to activate a fall prevention mechanism. Here we present a methodology for assessing the learnability of data based on the Lipschitz quotient. We develop a procedure for determining which regions of the dataset contain adversarial data points, input data that look similar but belong to different target classes. Regardless of the learning model, it will be hard to learn such data. We then present a method for determining which additional feature(s) are most effective in improving the predictability of each of these regions. This is a model-independent data analysis that can be executed before constructing a prediction model through machine learning or other techniques. We demonstrate this method on two synthetic datasets and a data set of human falls, which uses inertial measurement unit signals. For the fall dataset, we were able to identify two groups of adversarial data points and improve the predictability of each group over the baseline dataset as assessed by Lipschitz by using 2 different sets of features. This work offers a valuable tool for assessing data learnability that can be applied to not only fall prediction problems, but also other robotics applications that learn from data.

17:00-17:05, Paper TuDT8.6	Add to My Program
3D Space Perception Via Disparity Learning Using Stereo Images and an Attention Mechanism: Real-Time Grasping Motion Generation for Transparent Objects

Cai, Xianbo	Waseda University
Ito, Hiroshi	Hitachi, Ltd. / Waseda University
Hiruma, Hyogo	Waseda University
Ogata, Tetsuya	Waseda University
Keywords: Representation Learning, Perception-Action Coupling Abstract: Object grasping in 3D space is crucial for robotic applications. Such tasks are performed by utilizing depth map data acquired from RGB-D images or 3D point cloud data. However, these methods struggle when dealing with transparent objects, as transparency limits sensor performance when predicting depth maps. Additionally, the grasping motions are predicted without incorporating the relationship between depth data and motion information, which limits the motion’s flexibility. In this paper, to address these problems, we propose an end-to-end motion generation model using stereo RGB images, a deep-learning model that incorporates image and motion information. Furthermore, visual attention mechanisms are used for extracting task-related attention points, which is essential for building spatial cognition constructs. Real-robot experimental results confirmed that the proposed model is able to grasp transparent objects under various situations, including unseen positions, heights, and background. It was also found that the model self-organized a spatial cognition representation within its hidden states, suggesting that the integrated learning of robot motion and stable spatial attention points is important for spatial perception. Such explicit feature representations cannot be obtained via learning motion alone.


TuDT9 Regular Session, 312	Add to My Program
Motion Planning 4

Chair: Uwacu, Diane	Texas A&M University
Co-Chair: Lee, Dongjun	Seoul National University

16:35-16:40, Paper TuDT9.1	Add to My Program
Making a Complete Mess and Getting Away with It: Traveling Salesperson Problems with Circle Placement Variants

Woller, David	Czech Technical University in Prague
Mansouri, Masoumeh	Birmingham University
Kulich, Miroslav	Czech Technical University in Prague
Keywords: Task and Motion Planning, Constrained Motion Planning, Computational Geometry Abstract: This paper explores a variation of the Traveling Salesperson Problem, where the agent places a circular obstacle next to each node once it visits it. Referred to as the Traveling Salesperson Problem with Circle Placement (TSP-CP), the aim is to maximize the obstacle radius for which a valid closed tour exists and then minimize the tour cost. The TSP-CP finds relevance in various real-world applications, such as harvesting, quarrying, and open-pit mining. We propose several novel solvers to address the TSP-CP, its variant tailored for Dubins vehicles, and a crucial subproblem known as the Traveling Salesperson Problem on self-deleting graphs (TSP-SD). Our extensive experimental results show that the proposed solvers outperform the current state-of-the-art on related problems in solution quality.

16:40-16:45, Paper TuDT9.2	Add to My Program
Narrow Passage Path Planning Using Collision Constraint Interpolation

Lee, Minji	Seoul National University
Lee, Jeongmin	Seoul National University
Lee, Dongjun	Seoul National University
Keywords: Motion and Path Planning, Constrained Motion Planning, Manipulation Planning Abstract: Narrow passage path planning is a prevalent problem from industrial to household sites, often facing difficulties in finding feasible paths or requiring excessive computational resources. Given that deep penetration into the environment can cause optimization failure, we propose a framework to ensure feasibility throughout the process using a series of subproblems tailored for narrow passage problem. We begin by decomposing the environment into convex objects and initializing collision constraints with a subset of these objects. By continuously interpolating the collision constraints through the process of sequentially introducing remaining objects, our proposed framework generates subproblems that guide the optimization toward solving the narrow passage problem. Several examples are presented to demonstrate how the proposed framework addresses narrow passage path planning problems.

16:45-16:50, Paper TuDT9.3	Add to My Program
Trajectory Planning with Signal Temporal Logic Costs Using Deterministic Path Integral Optimization

Halder, Patrick	ZF Friedrichshafen AG
Homburger, Hannes	HTWG Konstanz, Institute of System Dynamics
Kiltz, Lothar	ZF Friedrichshafen AG
Reuter, Johannes	University of Applied Sciences Constance
Althoff, Matthias	Technische Universität München
Keywords: Task and Motion Planning, Optimization and Optimal Control, Motion and Path Planning Abstract: Formulating the intended behavior of a dynamic system can be challenging. Signal temporal logic (STL) is frequently used for this purpose due to its suitability in formalizing comprehensible, modular, and versatile spatio-temporal specifications. Due to scaling issues with respect to the complexity of the specifications and the potential occurrence of non-differentiable terms, classical optimization methods often solve STL-based problems inefficiently. Smoothing and approximation techniques can alleviate these issues but require changing the optimization problem. This paper proposes a novel sampling-based method based on model predictive path integral control to solve optimal control problems with STL cost functions. We demonstrate the effectiveness of our method on benchmark motion planning problems and compare its performance with state-of-the-art methods. The results show that our method efficiently solves optimal control problems with STL costs.

16:50-16:55, Paper TuDT9.4	Add to My Program
Multi-Agent Path Finding Using Conflict-Based Search and Structural-Semantic Topometric Maps

Fredriksson, Scott	Luleå University of Technology
Bai, Yifan	Luleå University of Technology
Saradagi, Akshit	Luleå University of Technology, Luleå, Sweden
Nikolakopoulos, George	Luleå University of Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Multi-Robot Systems Abstract: As industries increasingly adopt large robotic fleets, there is a pressing need for computationally efficient, practical, and optimal conflict-free path planning for multiple robots. Conflict-Based Search (CBS) is a popular method for multi-agent path finding (MAPF) due to its completeness and optimality; however, it is often impractical for real-world applications, as it is computationally intensive to solve and relies on assumptions about agents and operating environments that are difficult to realize. This article proposes a solution to overcome computational challenges and practicality issues of CBS by utilizing structural-semantic topometric maps. Instead of running CBS over large grid-based maps, the proposed solution runs CBS over a sparse topometric map containing structural-semantic cells representing intersections, pathways, and dead ends. This approach significantly accelerates the MAPF process and reduces the number of conflict resolutions handled by CBS while operating in continuous time. In the proposed method, robots are assigned time ranges to move between topometric regions, departing from the traditional CBS assumption that a robot can move to any connected cell in a single time step. The approach is validated through real-world multi-robot path-finding experiments and benchmarking simulations. The results demonstrate that the proposed MAPF method can be applied to real-world non-holonomic robots and yields significant improvement in computational efficiency compared to traditional CBS methods while improving conflict detection and resolution in cases of corridor symmetries.

16:55-17:00, Paper TuDT9.5	Add to My Program
Topo-Geometrically Distinct Path Computation Using Neighborhood-Augmented Graph, and Its Application to Path Planning for a Tethered Robot in 3D

Sahin, Alp	Lehigh University
Bhattacharya, Subhrajit	Lehigh University
Keywords: Motion and Path Planning, Optimization and Optimal Control, Foundations of Automation, Multi Path Planning Abstract: Many robotics applications benefit from being able to compute multiple geodesic paths in a given configuration space. Existing paradigm is to use topological path planning, which can compute optimal paths in distinct topological classes. However, these methods usually require non-trivial geometric constructions which are prohibitively expensive in 3D, and are unable to distinguish between distinct topologically equivalent geodesics that are created due to high-cost/curvature regions or prismatic obstacles in 3D. In this paper, we propose an approach to compute k geodesic paths using the concept of a novel neighborhood-augmented graph, on which graph search algorithms can compute multiple optimal paths that are topo-geometrically distinct. Our approach does not require complex geometric constructions, and the resulting paths are not restricted to distinct topological classes, making the algorithm suitable for problems where finding and distinguishing between geodesic paths are of interest. We demonstrate the application of our algorithm to planning shortest traversible paths for a tethered robot in 3D with cable-length constraint.

17:00-17:05, Paper TuDT9.6	Add to My Program
Homotopy-Aware Efficiently Adaptive State Lattices for Mobile Robot Motion Planning in Cluttered Environments

Menon, Ashwin	University of Rochester
Damm, Eric	University of Rochester
Howard, Thomas	University of Rochester
Keywords: Field Robots, Motion and Path Planning Abstract: Mobile robot navigation architectures that employ a planning algorithm to provide a single optimal path to follow are flawed in the presence of unstructured, rapidly changing environments. As the environment updates, optimal plans often oscillate around discrete obstacles, which is problematic for path following controllers that are biased to follow the planned route. A potentially better approach involves the generation of multiple plans, each optimal within their own homotopy class, to provide a more comprehensive approximation of cost to goal for a path-following controller. In this paper, we present Homotopy-Aware Efficiently Adaptive State Lattices (HAEASL), which uses multiple open lists to bias search towards routes with distinct homotopy classes. Experiments are presented that measure the number, the optimality, and the diversity of solutions generated across 3,200 planning problems in 80 randomly generated environments. The performance of HAEASL is benchmarked against two previous approaches: Search-Based Path Planning with Homotopy Class Constraints (AHC) and Homotopy-Aware RRT (HARRT). Experimental results demonstrate that HAEASL can generate a greater number of paths and more diverse paths than AHC without a significant reduction of optimality. Additionally, results demonstrate that HAEASL generates a greater number of paths and ones with lower costs than HARRT*. A final demonstration of HAEASL generating multiple solutions subject to temporal, resource, and kinodynamic constraints using data collected from an off-road mobile robot illustrates the suitability of the approach for the motivating example.


TuDT10 Regular Session, 313	Add to My Program
Multi-Robot and Human-Robot Teams

Chair: Min, Byung-Cheol	Purdue University
Co-Chair: Sevil, Hakki Erhan	University of West Florida

16:35-16:40, Paper TuDT10.1	Add to My Program
Initial Task Allocation in Multi-Human Multi-Robot Teams: An Attention-Enhanced Hierarchical Reinforcement Learning Approach

Wang, Ruiqi	Purdue University
Zhao, Dezhong	Beijing University of Chemical Technology
Gupte, Arjun	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Human-Robot Teaming, Human-Robot Collaboration, Design and Human Factors Abstract: Multi-human multi-robot teams (MH-MR) obtain tremendous potential in tackling intricate and massive missions by merging distinct strengths and expertise of individual members. The inherent heterogeneity of these teams necessitates advanced initial task allocation (ITA) methods that align tasks with the intrinsic capabilities of team members from the outset. While existing reinforcement learning approaches show encouraging results, they might fall short in addressing the nuances of long-horizon ITA problems, particularly in settings with large-scale MH-MR teams or multifaceted tasks. To bridge this gap, we propose an attention-enhanced hierarchical reinforcement learning approach that decomposes the complex ITA problem into structured sub-problems, facilitating more efficient allocations. To bolster sub-policy learning, we introduce a hierarchical cross-attribute attention (HCA) mechanism, encouraging each sub-policy within the hierarchy to discern and leverage the specific nuances in the state space that are crucial for its respective decision-making phase. Through an extensive environmental surveillance case study, we demonstrate the benefits of our model and the HCA inside.

16:40-16:45, Paper TuDT10.2	Add to My Program
Enabling Multi-Robot Collaboration from Single-Human Guidance

Ji, Zhengran	Duke University
Zhang, Lingyu	Duke University
Sajda, Paul	Columbia University
Chen, Boyuan	Duke University
Keywords: Human Factors and Human-in-the-Loop, Learning from Demonstration, Multi-Robot Systems Abstract: Learning collaborative behaviors is essential for multi-agent systems. Traditionally, multi-agent reinforcement learning solves this implicitly through a joint reward and centralized observations, assuming collaborative behavior will emerge. Other studies propose to learn from demonstrations of a group of collaborative experts. We instead propose an efficient and explicit way of learning collaborative behaviors in multi-agent systems by leveraging expertise from only a single human. Our insight is that humans have the natural ability to take on various roles in a team. We show that by allowing a human operator to dynamically switch between controlling agents for a short period of time and incorporating a human-like theory-of-mind model of teammates, agents can effectively learn to collaborate. Our experiments showed that our method improves the success rate of a challenging collaborative hide-and-seek task by up to 58% with only 40 minutes of human guidance. We further demonstrate our findings transfer to the real world by conducting multi-robot experiments.

16:45-16:50, Paper TuDT10.3	Add to My Program
Fan-Out Revisited: The Impact of the Human Element on Scalability of Human Multi-Robot Teams

Perkins, Lawrence Dale	University of West Florida
Johnson, Matthew	Inst. for Human & Machine Cognition
Sevil, Hakki Erhan	University of West Florida
Goodrich, Michael A.	Brigham Young University
Keywords: Human-Robot Teaming, Human-Robot Collaboration, Multi-Robot Systems Abstract: This paper introduces a novel fan-out model that improves accuracy over previous models. The commonly used models rely on neglect time, the time an agent operates independently, which confounds both human and robot abilities. The proposed model separates neglect time into two functionally distinct concepts: the time a robot can operate self-sufficiently, and the time a human estimates the robot can do so. Previous research indicates fan-out is often overestimated. This work explains why robot ability provides an upper bound to fanout, but that actual achieved fan-out is influenced by both the human and robot abilities. We conduct a study to validate this new model and show improved performance over the two most common fan-out models. The results show that both previous models overestimate as predicted. Using the new fan-out model, we show that as the difference between human estimation and robot abilities grows, the actual fan-out will fall further from the upper bound potential fan-out. By including assessments of both the robotic and human elements, the new model provides a more nuanced understanding of the dynamics at play and the factors involved in scaling Human Multi-Robot Teams.

16:50-16:55, Paper TuDT10.4	Add to My Program
HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning

Hu, Huawen	Northwestern Polytechnical University
Shi, Enze	Northwestern Polytechnical University
Yue, Chenxi	Northwestern Polytechnical University
Yang, Shuocun	Northwestern Polytechnical University
Wu, Zihao	University of Georgia
Li, Yiwei	UGA
Zhong, Tianyang	Northwestern Polytechnical University
Zhang, Tuo	Northwestern Polytechnical University
Liu, Tianming	University of Georgia
Zhang, Shu	Northwestern Polytechnical University
Keywords: Human Factors and Human-in-the-Loop, Reinforcement Learning, Human-Robot Collaboration Abstract: Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human-Assisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at https://github.com/huawen-hu/HARP.

16:55-17:00, Paper TuDT10.5	Add to My Program
Training Human-Robot Teams by Improving Transparency through a Virtual Spectator Interface

Dallas, Sean	Oakland University
Qiang, Hongjiao	University of Michigan
AbuHijleh, Motaz	Oakland University
Jo, Wonse	University of Michigan
Riegner, Kayla	Ground Vehicle Systems Center (GVSC)
Smereka, Jonathon M.	U.S. Army TARDEC
Robert, Lionel	University of Michigan
Louie, Wing-Yue Geoffrey	Oakland University
Tilbury, Dawn	University of Michigan
Keywords: Human-Robot Teaming, Human Factors and Human-in-the-Loop, Human-Centered Robotics Abstract: After-action reviews (AARs) are professional discussions that help operators and teams enhance their task performance by analyzing completed missions with peers and professionals. Previous studies comparing different formats of AARs have focused mainly on human teams. However, the inclusion of robotic teammates brings along new challenges in understanding teammate intent and communication. Traditional AAR between human teammates may not be satisfactory for human-robot teams. To address this limitation, we propose a new training review (TR) tool, called the Virtual Spectator Interface (VSI), to enhance human-robot team performance and situational awareness (SA) in a simulated search mission. The proposed VSI primarily utilizes visual feedback to review subjects’ behavior. To examine the effectiveness of VSI, we took elements from AAR to conduct our own TR, and designed a 1 × 3 between-subjects experiment with experimental conditions: TR with (1) VSI, (2) screen recording, and (3) non-technology (only verbal descriptions). The results of our experiments demonstrated that the VSI did not result in significantly better team performance than other conditions. However, the TR with VSI led to more improvement in the subjects’ SA over the other conditions.

17:00-17:05, Paper TuDT10.6	Add to My Program
Adaptive Task Allocation in Multi-Human Multi-Robot Teams under Team Heterogeneity and Dynamic Information Uncertainty

Yuan, Ziqin	Purdue University
Wang, Ruiqi	Purdue University
Kim, Taehyeon	Purdue University
Zhao, Dezhong	Beijing University of Chemical Technology
Obi, Ike	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Human-Robot Teaming, Task Planning, Reinforcement Learning Abstract: Task allocation in multi-human multi-robot (MH-MR) teams presents significant challenges due to the inherent heterogeneity of team members, the dynamics of task execution, and the information uncertainty of operational states. Existing approaches often fail to address these challenges simultaneously, resulting in suboptimal performance. To tackle this, we propose an adaptive task allocation method using hierarchical reinforcement learning (HRL), incorporating initial task allocation (ITA) that leverages team heterogeneity and conditional task reallocation in response to dynamic operational states. Additionally, we introduce an auxiliary state representation learning task to manage information uncertainty and enhance task execution. Through an extensive case study in large-scale environmental monitoring tasks, we demonstrate the benefits of our approach. More details are available on our website: https://sites.google.com/view/ata-hrl.


TuDT11 Regular Session, 314	Add to My Program
Human-Robot Interaction 2

Chair: Dogan, Fethiye Irmak	University of Cambridge
Co-Chair: Alves-Oliveira, Patrícia	Amazon Lab126

16:35-16:40, Paper TuDT11.1	Add to My Program
“Don’t Forget to Put the Milk Back!” Dataset for Enabling Embodied Agents to Detect Anomalous Situations

Mullen, James	University of Maryland
Goyal, Prasoon	Amazon
Piramuthu, Robinson	Amazon
Johnston, Michael	Amazon
Manocha, Dinesh	University of Maryland
Ghanadan, Reza	Amazon
Keywords: Robot Companions, AI-Based Methods, Semantic Scene Understanding Abstract: Home robots intend to make their users lives easier. Our work aims to assist in this goal by enabling robots to inform their users of dangerous or unsanitary anomalies in their home. Some examples of these anomalies include the user leaving their milk out, forgetting to turn off the stove, or leaving poison accessible to children. To move towards enabling home robots with these abilities, we have created a new dataset, which we call SafetyDetect. The SafetyDetect dataset consists of 1000 anomalous home scenes, each of which contains unsafe or unsanitary situations for an agent to detect. Our approach utilizes large language models (LLMs) alongside both a graph representation of the scene and the relationships between the objects in the scene. Our key insight is that this connected scene graph and the object relationships it encodes enables the LLM to better reason about the scene --- especially as it relates to detecting dangerous or unsanitary situations. Our most promising approach utilizes GPT-4 and pursues a classification technique where object relations from the scene graph are classified as normal, dangerous, unsanitary, or dangerous for children. This method is able to correctly identify over 90% of anomalous scenarios in the SafetyDetect Dataset. Additionally, we conduct real world experiments on a ClearPath TurtleBot where we generate a scene graph from visuals of the real world scene, and run our approach with no modification. This setup resulted in little performance loss. The SafetyDetect dataset and code will be released to the public upon this papers publication.

16:40-16:45, Paper TuDT11.2	Add to My Program
Development of Contactless Delivery Service Robot with Modular Working Platform in Isolation Wards

Yang, Kyon-Mo	Korea Institute of Robot and Convergence
Koo, Jaewan	Korea Institute of Robotics and Technology Convergence ; KIRO
Seo, Kap-Ho	Korea Institute of Robot and Convergence
Keywords: Human-Centered Automation, Medical Robots and Systems, Human-Centered Robotics Abstract: Preventing cross-infection is crucial for robots designed to assist medical staff in isolation wards during outbreaks of infectious diseases like COVID-19. This paper proposes a modular robotic system with a working platform and a mobile base to prevent cross-infection during item delivery and waste transport. An alignment structure for combining the two platforms is introduced, and a marker map and barcode-based destination input system were developed to allow medical staff without specialized robotics knowledge to use the system without additional training. The effectiveness of this robot's service was evaluated through a System Usability Scale (SUS) test with twenty medical staff working in isolation wards, achieving an average score of 77.12. This indicates a high level of usability, suggesting that this robot can significantly contribute to safe and efficient hospital operations during pandemic situations.

16:45-16:50, Paper TuDT11.3	Add to My Program
RACCOON: Grounding Embodied Question-Answering with State Summaries from Existing Robot Modules

Bustamante, Samuel	German Aeroespace Center (DLR), Robotics and Mechatronics Center
Knauer, Markus	German Aerospace Center (DLR)
Thun, Jeremias	University Bremen
Schneyer, Stefan	German Aerospace Center (DLR)
Albu-Schäffer, Alin	DLR - German Aerospace Center
Weber, Bernhard	German Aerospace Center
Stulp, Freek	DLR - Deutsches Zentrum Für Luft Und Raumfahrt E.V
Keywords: Human-Centered Robotics Abstract: Explainability is vital for establishing user trust, also in robotics. Recently, foundation models (e.g. vision-language models, VLMs) fostered a wave of embodied agents that answer arbitrary queries about their environment and their interactions with it. However, naively prompting VLMs to answer queries based on camera images does not take into account existing robot architectures which represent the robot's tasks, skills, and beliefs about the state of the world. To overcome this limitation, we propose RACCOON, a framework that combines foundation models' responses with a robot's internal knowledge. Inspired by Retrieval-Augmented Generation (RAG), RACCOON selects relevant context, retrieves information from the robot's state, and utilizes it to refine prompts for an LLM to answer questions accurately. This bridges the gap between the model's adaptability and the robot's domain expertise.

16:50-16:55, Paper TuDT11.4	Add to My Program
GRACE: Generating Socially Appropriate Robot Actions Leveraging LLMs and Human Explanations

Dogan, Fethiye Irmak	University of Cambridge
Ozyurt, Umut	Middle East Technical University
Çınar, Gizem	Bilkent University
Gunes, Hatice	University of Cambridge
Keywords: Human Factors and Human-in-the-Loop, Human-Centered Robotics, Deep Learning Methods Abstract: When operating in human environments, robots need to handle complex tasks while both adhering to social norms and accommodating individual preferences. For instance, based on common sense knowledge, a household robot can predict that it should avoid vacuuming during a social gathering, but it may still be uncertain whether it should vacuum before or after having guests. In such cases, integrating common-sense knowledge with human preferences, often conveyed through human explanations, is fundamental yet a challenge for existing systems. In this paper, we introduce GRACE, a novel approach addressing this while generating socially appropriate robot actions. GRACE leverages common sense knowledge from LLMs, and it integrates this knowledge with human explanations through a generative network. The bidirectional structure of GRACE enables robots to refine and enhance LLM predictions by utilizing human explanations and makes robots capable of generating such explanations for human-specified actions. Our evaluations show that integrating human explanations boosts GRACE's performance, where it outperforms several baselines and provides sensible explanations.

16:55-17:00, Paper TuDT11.5	Add to My Program
Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Xiao, Anxing	National University of Singapore
Janaka, Nuwan	National University of Singapore
Hu, Tianrun	National University of Singapore
Gupta, Anshul	National University of Singapore
Li, Kaixin	National University of Singapore
Yu, Cunjun	NUS
Hsu, David	National University of Singapore
Keywords: Human-Centered Robotics, AI-Enabled Robotics, Virtual Reality and Interfaces Abstract: Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

17:00-17:05, Paper TuDT11.6	Add to My Program
AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-To-Specific Task Decomposition and Knowledge Refinement

Singh, Shivam	International Institute of Information Technology Hyderabad
Swaminathan, Karthik	International Institutue of Information Technology - Hyderabad (
Dash, Nabanita	International Institute of Information Technology, Hyderabad
Singh, Ramandeep	International Institute of Information Technology, Hyderabad
Banerjee, Snehasis	Iiit-H / Tcs
Sridharan, Mohan	University of Edinburgh
Krishna, Madhava	IIIT Hyderabad
Keywords: Human Factors and Human-in-the-Loop, AI-Based Methods, Task Planning Abstract: An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website§: https://sssshivvvv.github.io/adaptbot/


TuDT12 Regular Session, 315	Add to My Program
Information Gathering, Planning and Control in Challenging Environments

Chair: Oh, Hyondong	UNIST
Co-Chair: Bobadilla, Leonardo	Florida International University

16:35-16:40, Paper TuDT12.1	Add to My Program
LCD-RIG: Limited Communication Decentralized Robotic Information Gathering Systems

Redwan Newaz, Abdullah Al	University of New Orleans
Padrao, Paulo	Florida International University
Fuentes, Jose	Florida International University
Alam, Tauhidul	Lamar University
Govindarajan, Ganesh	Florida International University
Bobadilla, Leonardo	Florida International University
Keywords: Environment Monitoring and Management, Planning, Scheduling and Coordination, Distributed Robot Systems Abstract: Effective data collection in collaborative information-gathering systems relies heavily on maintaining uninterrupted connectivity. Yet, real-world communication disruptions often pose challenges to information-gathering processes. To address this issue, we introduce a novel method —a limited communication decentralized information gathering system for multiple robots to explore environmental phenomena characterized as unknown spatial fields. Our method leverages quadtree structures to ensure comprehensive workspace coverage and efficient exploration. Unlike traditional systems that depend on global and synchronous communication, our method enables robots to share local experiences within a limited transmission range and coordinate their tasks through pairwise and asynchronous communication. Information estimation is facilitated by a Gaussian Process with an Attentive Kernel, allowing adaptive capturing of crucial behavior and data patterns. Our proposed system undergoes validation through simulated scalar field studies in non-stationary environments where multiple robots explore spatial fields. Theoretical guarantees ensure the convergence of distributed area coverage and the regret bounds of distributed online scalar field mapping. We also validate the applicability of our method empirically in a water quality monitoring scenario featuring two Autonomous Surface Vehicles, tasked with constructing a spatial field.

16:40-16:45, Paper TuDT12.2	Add to My Program
Multi-Agent Path Planning in Complex Environments Using Gaussian Belief Propagation with Global Path Finding

Jensen, Jens Høigaard	Aarhus University
Plagborg Bak Sørensen, Kristoffer	Aarhus University
le Fevre Sejersen, Jonas	Aarhus University
Sarabakha, Andriy	Aarhus University
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance Abstract: Multi-agent path planning is a critical challenge in robotics, requiring agents to navigate complex environments while avoiding collisions and optimizing travel efficiency. This work addresses the limitations of existing approaches by combining Gaussian belief propagation with path integration and introducing a novel tracking factor to ensure strict adherence to global paths. The proposed method is tested with two different global path-planning approaches: rapidly exploring random trees and a structured planner, which leverages predefined lane structures to improve coordination. A simulation environment was developed to validate the proposed method across diverse scenarios, each posing unique challenges in navigation and communication. Simulation results demonstrate that the tracking factor reduces path deviation by 28% in single-agent and 16% in multi-agent scenarios, highlighting its effectiveness in improving multi-agent coordination, especially when combined with structured global planning.

16:45-16:50, Paper TuDT12.3	Add to My Program
Olympus: A Jumping Quadruped for Planetary Exploration Utilizing Reinforcement Learning for In-Flight Attitude Control

Olsen, Jørgen Anker	Norwegian University of Science and Technology
Malczyk, Grzegorz	NTNU
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Keywords: Space Robotics and Automation, Legged Robots Abstract: Exploring planetary bodies with lower gravity, such as the moon and Mars, allows legged robots to utilize jumping as an efficient form of locomotion thus giving them a valuable advantage over traditional rovers for exploration. Motivated by this fact, this paper presents the design, simulation, and learning-based "in-flight" attitude control of Olympus, a jumping legged robot tailored to the gravity of Mars. First, the design requirements are outlined followed by detailing how simulation enabled optimizing the robot's design - from its legs to the overall configuration - towards high vertical jumping, forward jumping distance, and in-flight attitude reorientation. Subsequently, the reinforcement learning policy used to track desired in-flight attitude maneuvers is presented. Successfully crossing the sim2real gap, extensive experimental studies of attitude reorientation tests are demonstrated.

16:50-16:55, Paper TuDT12.4	Add to My Program
THAMP-3D: Tangent-Based Hybrid A* Motion Planning for Tethered Robots in Sloped 3D Terrains

Kumar, Rahul	Northeastern University
Chipade, Vishnu S.	University of Michigan
Yong, Sze Zheng	Northeastern University
Keywords: Motion and Path Planning, Constrained Motion Planning, Nonholonomic Motion Planning Abstract: This paper introduces a novel motion planning algorithm designed for a team of curvature-constrained tethered robots operating on sloped 3D terrains. Our approach addresses the critical issues of tether-terrain interaction, robot stability, and tether entanglement avoidance. The study focuses on a two-robot system, where stability is primarily dependent on tether tension, which is in turn limited by wheel traction. We propose a path-planning method that strategically utilizes terrain features (e.g., rocks) to augment tether tension through additional friction, thereby enhancing overall system stability. Our algorithm employs a modified tangent graph as the underlying structure for a hybrid A^* search, incorporating stability constraints throughout the planning process. The proposed method is extensively evaluated through various simulation experiments, demonstrating its effectiveness in planning safe and efficient paths.

16:55-17:00, Paper TuDT12.5	Add to My Program
Deep Learning Based Topography Aware Gas Source Localization with Mobile Robot

Tian, Changhao	Nanyang Technological University
Wang, Annan	Nanyang Technological University
Fan, Han	Örebro University
Wiedemann, Thomas	German Aerospace Center (DLR)
Luo, Yifei	Institute of Materials Research and Engineering (IMRE), Agency F
Yang, Le	Institute of Materials Research and Engineering, Agency for Scie
Lin, Weisi	Nanyang Technological University
Lilienthal, Achim J.	Orebro University
Chen, Xiaodong	Nanyang Technological University
Keywords: Environment Monitoring and Management, Sensor Fusion, Deep Learning Methods Abstract: Gas source localization in complex environments is critical for applications such as environmental monitoring, industrial safety, and disaster response. Traditional methods often struggle with the challenges posed by a lack of environmental topography integration, especially when interactions between wind and obstacles distort gas dispersion patterns. In this paper, we propose a deep learning-based approach, which leverages spatial context and environmental mapping to enhance gas source localization. By integrating Simultaneous Localization and Mapping (SLAM) with a U-Net-based model, our method predicts the likelihood of gas source locations by analyzing gas sensor data, wind flow, and topography of the environment represented by a 2D occupancy map. We demonstrate the efficacy of our approach using a wheeled robot equipped with a photoionization detector, a LIDAR, and an anemometer, in various scenarios with dynamic wind fields and multiple obstacles. The results show that our approach can robustly locate gas sources, even in challenging environments with fluctuating wind directions, outperforming conventional methods by utilizing topography contextual information. This study underscores the importance of topographical context in gas source localization and offers a flexible and robust solution for real-world applications. Data and code are publicly available.

17:00-17:05, Paper TuDT12.6	Add to My Program
Gas Source Localization in Unknown Indoor Environments Using Dual-Mode Information-Theoretic Search

Kim, Seunghwan	UNIST
Seo, Jaemin	UNIST
Jang, Hongro	UNIST
Kim, Changseung	Ulsan National Institute of Science and Technology
Kim, Murim	Korea Institute of Robot and Convergence
Pyo, Juhyun	Korea Institute of Robotics & Technology Convergence
Oh, Hyondong	UNIST
Keywords: Planning under Uncertainty, Environment Monitoring and Management, Robotics in Hazardous Fields Abstract: This paper proposes a dual-mode planner for localizing gas sources using a mobile sensor in unknown indoor spaces. The complexity of indoor environments creates constraints on search paths, leading to situations where no valid paths can be generated, which are termed as dead end in this paper. The proposed dual-mode planner is designed to effectively address the dead end problem while maintaining efficient search paths. In addition, the absence of analytical dispersion models that can be used in unknown indoor environments presents another critical issue for indoor gas source localization (GSL). To address this, we present an indoor Gaussian dispersion model (IGDM) that can analytically model indoor gas dispersion without a complete map. Finally, we establish a GSL framework for indoor environments along with real-time mapping, utilizing the dual-mode planner and IGDM. This framework is validated in indoor scenarios with the realistic gas dispersion simulator. The simulation results show the high success rate of the proposed method, its ability to reduce search time, and its computational efficiency. Furthermore, through real-world experiments, we demonstrate the potential of the proposed approach as a practical solution, evidenced by its satisfactory performance


TuDT13 Regular Session, 316	Add to My Program
Wearable Robotics 2

Chair: Rouse, Elliott	University of Michigan
Co-Chair: Torielli, Davide	Humanoids and Human Centered Mechatronics (HHCM), Istituto Italiano Di Tecnologia

16:35-16:40, Paper TuDT13.1	Add to My Program
Online Design Optimization of Passive Exoskeletons Using Fast Biomechanics Simulation and Reinforcement Learning

Vatsal, Vighnesh	Tata Consultancy Services
Keywords: Prosthetics and Exoskeletons, Reinforcement Learning, Modeling and Simulating Humans Abstract: Exoskeletons are being adopted as assistive devices in industries such as manufacturing, logistics, and construction, aimed at reducing musculoskeletal loads in workers. Presently, their design process assumes the user to be quasi-static, optimizing the design parameters for reduction of human joint torques followed by fine-tuning through usability studies and physical prototyping. We present a method for optimizing passive exoskeleton designs before the physical prototyping stage for muscle effort reduction in dynamic tasks such as arm reaching and walking. We employ fast MuJoCo-based simulations of human biomechanics to compute the joint torques, muscle forces and muscle activations while executing task trajectories using pre-trained reinforcement learning models from the literature. We train another set of reinforcement learning models that minimize joint torques and muscle effort rates by varying the exoskeleton's design parameters online during the task motions. Baselines for comparison include the default designs of shoulder and walking assist exoskeletons from the literature, and designs obtained through conventional optimization techniques. In terms of muscle effort rates, the RL-based designs improved upon these baselines by an average of 3.42% and 1.96% respectively in the arm reaching task, and 6.28% and 5.81% in the walking task. Our method can be adapted to evaluate exoskeletons in real-time through motion capture, and for muscle-aware online control of powered exoskeletons.

16:40-16:45, Paper TuDT13.2	Add to My Program
Accurately Modeling the Output Torque and Stiffness of Ankle-Foot Orthoses with a Compliant Linkage Model

Lam, David	University of Michigan - Ann Arbor
Van Crey, Nikko	University of Michigan Ann Arbor
Rouse, Elliott	University of Michigan
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, Physically Assistive Devices Abstract: The stiffness of passive lower-limb exoskeletons and orthoses governs their assistance. A common practice in the design of these systems is to assume the stiffness of the device is determined only by the intended elastic element (e.g., spring), while the structural components, human attachments, and soft tissues are considered rigid. In practice, the mechanical behavior of orthoses is significantly affected by the compliance of these elements, which drastically impacts the assistance provided. In this work, we present a linkage model with compliant elements that can accurately predict the applied stiffness of ankle-foot orthoses, and retroactively estimate the stiffness of unintended spring elements from published data. The compliant model accurately predicted the torque trajectories of two published passive orthoses with modeled peak torques within 4% to 7% of measured values. In contrast, the rigid model greatly overestimated the peak torques, predicting 203% to 376% of the measured values. The compliant model also indicated that an onboard joint encoder could only measure 52% to 69% of the peak ankle angle recorded with motion capture. The compliant model was also used to reassess the stiffness range of a variable-stiffness orthosis, indicating that its adjustable range is likely 69% of rigid model predictions. Overall, this work highlights the need to consider how unmodeled compliance affects the mechanical behavior of orthoses and provides a foundation for further exploration.

16:45-16:50, Paper TuDT13.3	Add to My Program
Towards Neurorobotic Interface for Finger Joint Angle Estimation: A Multi-Stage CNN-LSTM Network with Transfer Learning

Chen, Yun	The University of Alabama
Zhang, Xinyu	The University of Alabama
Li, Hui	University of Alabama
He, Hongsheng	The University of Alabama
Shou, Wan	University of Arkansas
Zhang, Qiang	The University of Alabama
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Sensor Fusion Abstract: To maximize the autonomy of individuals with upper limb amputations in daily activities, leveraging forearm muscle information to infer movement intent is a promising research direction. While current prosthetic hand technologies can utilize forearm muscle data to achieve basic movements such as grasping, accurately estimating finger joint angles remains a significant challenge. Therefore, we propose a Multi-Stage Cascade Convolutional Neural Network with a Long Short-Term Memory Network, where an upsampling module is introduced before the downsampling module to enhance model generalization. Additionally, we designed a transfer learning (TL) framework based on parameter freezing, where the pre-trained downsampling module is fixed, and only the upsampling module is updated with a small amount of out-of-distribution data to achieve TL. Furthermore, we compared the performance of unimodal and multimodal models, collecting surface electromyography (sEMG) signals, brightness mode ultrasound images (B-mode US images), and motion capture data simultaneously. The results show that on the validation set, the US image had the lowest error, while on the prediction set, the four-channel sEMG achieved the lowest error. The performance of the multimodal model in both datasets was intermediate between the unimodal models. On the prediction set, the average normalized root mean square error values for the four-channel sEMG, US images, and sensor fusion models across three subjects were 0.170, 0.203, and 0.186, respectively. By utilizing advanced sensor fusion techniques and TL, our approach can reduce the need for extensive data collection and training for new users, making prosthetic control more accessible and adaptable to individual needs.

16:50-16:55, Paper TuDT13.4	Add to My Program
Design, Characterization, and Validation of a Variable Stiffness Prosthetic Elbow

Milazzo, Giuseppe	Istituto Italiano Di Tecnologia
Lemerle, Simon	University of Pisa
Grioli, Giorgio	Istituto Italiano Di Tecnologia
Bicchi, Antonio	Fondazione Istituto Italiano Di Tecnologia
Catalano, Manuel Giuseppe	Istituto Italiano Di Tecnologia
Keywords: Prosthetics and Exoskeletons, Variable Stiffness Actuators, Compliant Joint/Mechanism, Mechanism Design Abstract: Intuitively, prostheses with user-controllable stiffness could mimic the intrinsic behavior of the human musculoskeletal system, promoting safe and natural interactions and task adaptability in real-world scenarios. However, prosthetic design often disregards compliance because of the additional complexity, weight, and needed control channels. This article focuses on designing a variable stiffness actuator (VSA) with weight, size, and performance compatible with prosthetic applications, addressing its implementation for the elbow joint. While a direct biomimetic approach suggests adopting an agonist-antagonist (AA) layout to replicate the biceps and triceps brachii with elastic actuation, this solution is not optimal to accommodate the varied morphologies of residual limbs. Instead, we employed the AA layout to craft an elbow prosthesis fully contained in the user’s forearm, catering to individuals with distal transhumeral amputations. In addition, we introduce a variant of this design where the two motors are split in the upper arm and forearm to distribute mass and volume more evenly along the bionic limb, enhancing comfort for patients with more proximal amputation levels. We characterize and validate our approach, demonstrating that both architectures meet the target requirements for an elbow prosthesis. The system attains the desired 120◦ range of motion, achieves the target stiffness range of [2, 60] N ・ m/rad, and can actively lift up to 3 kg. Our novel design reduces weight by up to 50% compared to existing VSAs for elbow prostheses while achieving performance comparable to the state of the art. Case studies suggest that passive and variable compliance could enable robust and safe interactions and task adaptability in the real world.

16:55-17:00, Paper TuDT13.5	Add to My Program
Long-Term Upper-Limb Prosthesis Myocontrol Via High-Density sEMG and Incremental Learning

Di Domenico, Dario	Italian Institute of Technology
Boccardo, Nicolò	IIT - Istituto Italiano Di Tecnologia
Marinelli, Andrea	University of Genova, Italian Institute of Technologies
Canepa, Michele	Italian Institute of Technology
Gruppioni, Emanuele	INAIL Prosthesis Center
Laffranchi, Matteo	Istituto Italiano Di Tecnologia
Camoriano, Raffaello	Politecnico Di Torino
Keywords: Prosthetics and Exoskeletons, Intention Recognition, Incremental Learning Abstract: Noninvasive human-machine interfaces such as surface electromyography (sEMG) have long been employed for controlling robotic prostheses. However, classical controllers are limited to few degrees of freedom (DoF). More recently, machine learning methods have been proposed to learn personalized controllers from user data. While promising, they often suffer from distribution shift during long-term usage, requiring costly model re-training. Moreover, most prosthetic sEMG sensors have low spatial density, which limits accuracy and the number of controllable motions. In this work, we address both challenges by introducing a novel myoelectric prosthetic system integrating a high density-sEMG (HD-sEMG) setup and incremental learning methods to accurately control 7 motions of the Hannes prosthesis. First, we present a newly designed, compact HD-sEMG interface equipped with 64 dry electrodes positioned over the forearm. Then, we introduce an efficient incremental learning system enabling model adaptation on a stream of data. We thoroughly analyze multiple learning algorithms across 7 subjects, including one with limb absence, and 6 sessions held in different days covering an extended period of several months. The size and time span of the collected data represent a relevant contribution for studying long-term myocontrol performance. Therefore, we release the DELTA dataset together with our experimental code.

17:00-17:05, Paper TuDT13.6	Add to My Program
ChatEMG: Synthetic Data Generation to Control a Robotic Hand Orthosis for Stroke

Xu, Jingxi	Columbia University
Wang, Runsheng	Columbia University
Shang, Siqi	Columbia University
Chen, Ava	Columbia University
Winterbottom, Lauren	Columbia University
Hsu, To-Liang	Columbia University
Chen, Wenxi	Columbia University
Ahmed, Khondoker	Columbia University
La Rotta, Pedro Leandro	Columbia University
Zhu, Xinyue	Columbia University
Nilsen, Dawn	Columbia University
Stein, Joel	Columbia University
Ciocarlie, Matei	Columbia University
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Wearable Robotics Abstract: Intent inferral on a hand orthosis for stroke patients is challenging due to the difficulty of data collection. Additionally, EMG signals exhibit significant variations across different conditions, sessions, and subjects, making it hard for classifiers to generalize. Traditional approaches require a large labeled dataset from the new condition, session, or subject to train intent classifiers; however, this data collection process is burdensome and time-consuming. In this paper, we propose ChatEMG, an autoregressive generative model that can generate synthetic EMG signals conditioned on prompts (i.e., a given sequence of EMG signals). ChatEMG enables us to collect only a small dataset from the new condition, session, or subject and expand it with synthetic samples conditioned on prompts from this new context. ChatEMG leverages a vast repository of previous data via generative training while still remaining context-specific via prompting. Our experiments show that these synthetic samples are classifier-agnostic and can improve intent inferral accuracy for different types of classifiers. We demonstrate that our complete approach can be integrated into a single patient session, including the use of the classifier for functional orthosis-assisted tasks. To the best of our knowledge, this is the first time an intent classifier trained partially on synthetic data has been deployed for functional control of an orthosis by a stroke survivor.


TuDT14 Regular Session, 402	Add to My Program
Large Models for Manipulation

Chair: Ugur, Emre	Bogazici University
Co-Chair: Mehr, Negar	University of California Berkeley

16:35-16:40, Paper TuDT14.1	Add to My Program
Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Honerkamp, Daniel	Albert Ludwigs Universität Freiburg
Büchner, Martin	University of Freiburg
Despinoy, Fabien	Toyota Motor Europe
Welschehold, Tim	Albert-Ludwigs-Universität Freiburg
Valada, Abhinav	University of Freiburg
Keywords: Mobile Manipulation, Integrated Planning and Learning, Domestic Robotics Abstract: To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. The resulting approach, given object detections, is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. We demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. In extensive experiments in both simulation and the real world, we show substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches, as well as its applicability to more abstract tasks. We make the code publicly available at https://moma-llm.cs.uni-freiburg.de.

16:40-16:45, Paper TuDT14.2	Add to My Program
Here's Your PDDL Problem File! on Using VLMs for Generating Symbolic PDDL Problem Files

Aregbede, Victor	Örebro University
Forte, Paolo	Örebro University
Gupta, Himanshu	Örebro University
Andreasson, Henrik	Örebro University
Köckemann, Uwe	Orebro Universitet
Lilienthal, Achim J.	Orebro University
Keywords: AI-Enabled Robotics, Task and Motion Planning Abstract: Large Language Models (LLMs) excel at generating contextually relevant text but lack logical reasoning abilities. They rely on statistical patterns rather than logical inference, making them unreliable for structured decision-making. Integrating LLMs with task planning can address this limitation by combining their natural language understanding with the precise, goal-oriented reasoning of planners. This paper introduces ViPlan, a hybrid system that leverages Vision Language Models (VLMs) to extract high-level semantic information from visual and textual inputs while integrating classical planners for logical reasoning. ViPlan utilizes VLMs to generate syntactically correct and semantically meaningful PDDL problem files from images and natural language instructions, which are then processed by a task planner to generate an executable plan. The entire process is embedded within a behavior tree framework, enhancing efficiency, reactivity, replanning, modularity, and flexibility. The generation and planning capabilities of ViPlan are empirically evaluated with simulated and real-world experiments.

16:45-16:50, Paper TuDT14.3	Add to My Program
MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Gao, Kai	Rutgers University
Wang, Fan	Amazon Robotics
Aduh, Erica	Amazon Robotics
Randle, Dylan Labatt	Amazon Robotics
Shi, Jane	Amazon
Keywords: Industrial Robots, Dexterous Manipulation, Learning from Demonstration Abstract: Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks. The accompanying video is available online.

16:50-16:55, Paper TuDT14.4	Add to My Program
CurricuLLM: Automatic Task Curricula Design for Learning Complex Robot Skills Using Large Language Models

Ryu, Kanghyun	University of California, Berkeley
Liao, Qiayuan	University of California, Berkeley
Li, Zhongyu	University of California, Berkeley
Delgosha, Payam	UIUC
Sreenath, Koushil	University of California, Berkeley
Mehr, Negar	University of California Berkeley
Keywords: Incremental Learning, Continual Learning, Transfer Learning Abstract: Curriculum learning is a training mechanism in reinforcement learning (RL) that facilitates the achievement of complex policies by progressively increasing the task difficulty during training. However, designing effective curricula for a specific task often requires extensive domain knowledge and human intervention, which limits its applicability across various domains. Our core idea is that large language models (LLMs), with their extensive training on diverse language data and ability to encapsulate world knowledge, present significant potential for efficiently breaking down tasks and decomposing skills across various robotics environments. Additionally, the demonstrated success of LLMs in translating natural language into executable code for RL agents strengthens their role in generating task curricula. In this work, we propose CurricuLLM, which leverages the high-level planning and programming capabilities of LLMs for curriculum design, thereby enhancing the efficient learning of complex target tasks. CurricuLLM consists of: (Step 1) Generating a sequence of subtasks that aid target task learning in natural language form, (Step 2) Translating natural language description of subtasks in executable task code, including the reward code and goal distribution code, and (Step 3) Evaluating trained policies based on trajectory rollout and subtask description. We evaluate CurricuLLM in various robotics simulation environments, ranging from manipulation, navigation, and locomotion, to show that CurricuLLM can aid learning complex robot control tasks. In addition, we validate humanoid locomotion policy learned through CurricuLLM in the real-world. Project website is https://iconlab.negarmehr.com/CurricuLLM/

16:55-17:00, Paper TuDT14.5	Add to My Program
PUGS: Zero-Shot Physical Understanding with Gaussian Splatting

Shuai, Yinghao	Tongji University
Yu, Ran	Tsinghua University
Chen, Yuantao	Xi'an University of Architecture and Technology
Jiang, Zijian	Tongji University
Song, Xiaowei	Tongji University
Wang, Nan	Tongji University
Zheng, Jv	Lightwheel.AI
Ma, Jianzhu	Tsinghua University
Yang, Meng	MGI
Wang, Zhicheng	Tongji University
Ding, Wenbo	Tsinghua University
Zhao, Hao	Tsinghua University
Keywords: Contact Modeling, Semantic Scene Understanding Abstract: Current robotic systems can understand the categories and poses of objects well. But understanding physical properties like mass, friction, and hardness, in the wild, remains challenging. We propose a new method that reconstructs 3D objects using the Gaussian splatting representation and predicts various physical properties in a zero-shot manner. We propose two techniques during the reconstruction phase: a geometry-aware regularization loss function to improve the shape quality and a region-aware feature contrastive loss function to promote region affinity. Two other new techniques are designed during inference: a feature-based property propagation module and a volume integration module tailored for the Gaussian representation. Our framework is named as zero-shot physical understanding with Gaussian splatting, or PUGS. PUGS achieves new state-of-the-art results on the standard benchmark of ABO-500 mass prediction. We provide extensive quantitative ablations and qualitative visualization to demonstrate the mechanism of our designs. We show the proposed methodology can help address challenging real-world grasping tasks. Our codes, data, and models are available at https://github.com/EverNorif/PUGS

17:00-17:05, Paper TuDT14.6	Add to My Program
ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference

Geng, Liang	Beijing University of Posts and Telecommunications, Shijiazhuang
Yin, Jianqin	Beijing University of Posts and Telecommunications
Keywords: Embodied Cognitive Science, Human-Robot Collaboration, Intention Recognition Abstract: 3D Visual Grounding (3D VG) is a fundamental task in embodied intelligence, which entails robots interpreting natural language descriptions to locate objects within 3D environments. The complexity of this task emerges as robots perceive the spatial relationships of objects differently depending on their observational viewpoints. In this work, we propose ViewInfer3D, a framework that leverages Large Language Models (LLMs) to infer embodied viewpoints, thereby avoiding incorrect observational viewpoints. To enhance the reliability and speed of reasoning from embodied viewpoints, we have designed three sub-strategies: constructing a hierarchical 3D scene graph, implementing embodied viewpoint parsing, and applying scene graph reasoning. Through extensive experiments, we demonstrate that this framework can improve performance in 3D Visual Grounding tasks through embodied viewpoint reasoning. Our framework achieves the best performance among all zeroshot methods on the ScanRefer and Nr3D/Sr3D datasets, without significantly increasing inference time.


TuDT15 Regular Session, 403	Add to My Program
Surgical Robotics: Planning

Chair: Howe, Robert D.	Harvard University
Co-Chair: Bano, Sophia	University College London

16:35-16:40, Paper TuDT15.1	Add to My Program
Image-Guided Surgical Planning for Percutaneous Nephrolithotomy Using CTRs: A Phantom-Based Study

Pedrosa, Filipe	Western University
Feizi, Navid	Brigham and Women's Hospital
Sacco, Dianne	Harvard University
Patel, Rajni	University of Western Ontario
Jayender, Jagadeesan	Harvard Medical School, Brigham and Women's Hospital
Keywords: Surgical Robotics: Planning, Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems Abstract: In this paper, we validate the effectiveness of the optimal planning algorithms we have developed for devising surgical plans for Percutaneous Nephrolithotomy (PCNL) using patient-specific Concentric-Tube Robots (CTRs). To do so, we built a life-sized phantom model of the right hemithorax, replicating the anatomy of a patient who suffered from kidney stone and underwent conventional PCNL. Two-dimensional CT scans of the phantom model and its 3D reconstruction enabled the creation of a surgical plan using our planning algorithms based on a puncture into the mid-pole of the kidney. This was compared with two other percutaneous tracts involving punctures into the lower and upper calyces for comparison. The optimal mid-pole plan achieved 84% stone coverage, significantly outperforming the lower pole (58%) and upper pole (45%) plans. These results validate the effectiveness of the algorithms and align with simulation-based findings from previous studies, which reported an average volume coverage of 81.6±19.6% in clinical cases.

16:40-16:45, Paper TuDT15.2	Add to My Program
Vision-Based Automatic Control of a Surgical Robot for Posterior Segment Ophthalmic Surgery (I)

Wang, Ning	Xi'an Jiaotong University
Zhang, Xiaodong	Xi'an Jiaotong University
Bano, Sophia	University College London
Stoyanov, Danail	University College London
Zhang, Hongbing	The First Affiliated Hospital of Northwestern University
Stilli, Agostino	University College London
Keywords: Medical Robots and Systems, Surgical Robotics: Planning, Vision-Based Navigation Abstract: In ophthalmic surgery, especially in posterior segment procedures, clinicians face significant challenges, like the inherent tremor of the surgeon’s arm, restricted visibility, and heavy reliance on the surgeon’s skills for precise control of hand-held tools during micro-surgical movements. Automatic control of robotic-assisted ophthalmic surgical systems has the potential to overcome these challenges, simplifying complex surgical procedures. This paper proposes a novel image-guided automatic control method for an Ophthalmic micro-Surgical Robot (OmSR), specifically designed for posterior segment eye surgery. The method relies on forceps shadow tracking. The paper introduces a tip detection network (Net-SR), which accurately calculates the coordinates of the Tips of Surgical Forceps (ToSF) and Tips of Shadow (ToS) to enable automatic navigation. Additionally, through the Non-Uniform Rational B-Spline (NURBS) curve interpolation and speed look-ahead algorithm, dense and time-continuous data points are obtained to improve control accuracy and smoothness. The accuracy of the Net-SR network and motion of the ToSF, and the effectiveness of the proposed automatic controller are experimentally evaluated. Results demonstrate a significant 98.21% improvement in the Net-SR network accuracy over the normal keypoint detection network. The use of the speed look-ahead algorithm leads to a notable 41.7% improvement in optimal speed, and the ToSF successfully reaches the target lesion.

16:45-16:50, Paper TuDT15.3	Add to My Program
ETSM: Automating Dissection Trajectory Suggestion and Confidence Map-Based Safety Margin Prediction for Robot-Assisted Endoscopic Submucosal Dissection

Xu, Mengya	National University of Singapore
Mo, Wenjin	Sun Yat-Sen University
Wang, Guankun	The Chinese University of Hong Kong
Gao, Huxin	National University of Singapore
Wang, An	The Chinese University of Hong Kong
Bai, Long	The Chinese University of Hong Kong
Li, Zhen	Qilu Hospital of Shandong University
Yang, Xiaoxiao	Qilu Hospital of Shandong University
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Keywords: Surgical Robotics: Planning, Data Sets for Robotic Vision, AI-Enabled Robotics Abstract: Robot-assisted Endoscopic Submucosal Dissection (ESD) improves the surgical procedure by providing a more comprehensive view through advanced robotic instruments and bimanual operation, thereby enhancing dissection efficiency and accuracy. Accurate prediction of dissection trajectories is crucial for better decision-making, reducing intraoperative errors, and improving surgical training. Nevertheless, predicting these trajectories is challenging due to variable tumor margins and dynamic visual conditions. To address this issue, we create the ESD Trajectory and Confidence Map-based Safety Margin (ETSM) dataset with 1849 short clips, focusing on submucosal dissection with a dual-arm robotic system. We also introduce a framework that combines optimal dissection trajectory prediction with a confidence map-based safety margin, providing a more secure and intelligent decision-making tool to minimize surgical risks for ESD procedures. Additionally, we propose the Regression-based Confidence Map Prediction Network (RCMNet), which utilizes a regression approach to predict confidence maps for dissection areas, thereby delineating various levels of safety margins. We evaluate our RCMNet using three distinct experimental setups: in-domain evaluation, robustness assessment, and out-of-domain evaluation. Experimental results show that our approach excels in the confidence map-based safety margin prediction task, achieving a mean absolute error (MAE) of only 3.18. To the best of our knowledge, this is the first study to apply a regression approach for visual guidance concerning delineating varying safety levels of dissection areas. Our approach bridges gaps in current research by improving prediction accuracy and enhancing the safety of the dissection process, showing great clinical significance in practice. The dataset and code will be made available.

16:50-16:55, Paper TuDT15.4	Add to My Program
Partial-To-Full Registration Based on Gradient-SDF for Computer-Assisted Orthopedic Surgery

Li, Tiancheng	University of Technology Sydney
Walker, Peter	Concord Repatriation General Hospital
Danial, Hammoud	Concord Repatriation General Hospital
Zhao, Liang	The University of Edinburgh
Huang, Shoudong	University of Technology, Sydney
Keywords: Surgical Robotics: Planning Abstract: In computer-assisted orthopedic surgery (CAOS), accurate pre-operative to intra-operative bone registration is an essential and critical requirement for providing navigational guidance. This registration process is challenging since the intra-operative 3D points are sparse, only partially overlapped with the pre-operative model, and disturbed by noise and outliers. The commonly used method in current state-of-the-art orthopedic robotic system is bony landmarks based registration, but it is very time-consuming for the surgeons. To address these issues, we propose a novel partial-to-full registration framework based on gradient-SDF for CAOS. The simulation experiments using bone models from publicly available datasets and the phantom experiments performed under both optical tracking and electromagnetic tracking systems demonstrate that the proposed method can provide more accurate results than standard benchmarks and be robust to 90% outliers. Importantly, our method achieves convergence in less than 1 second in real scenarios and mean target registration error values as low as 2.198 mm for the entire bone model. Finally, it only requires random acquisition of points for registration by moving a surgical probe over the bone surface without correspondence with any specific bony landmarks, thus showing significant potential clinical value. The code of the framework is available.

16:55-17:00, Paper TuDT15.5	Add to My Program
Sampling-Based Model Predictive Control for Volumetric Ablation in Robotic Laser Surgery

Wang, Vincent	Duke University
Prakash, Ravi	Duke University
Oca, Siobhan	Duke University
LoCicero, Ethan	Duke University
Codd, Patrick	Duke University
Bridgeman, Leila	Duke University
Keywords: Surgical Robotics: Planning, Constrained Motion Planning, Integrated Planning and Control Abstract: Laser-based surgical ablation relies heavily on surgeon involvement, restricting precision to the limits of human error and perception. The interaction between laser and tissue is governed by various laser parameters that control the laser irradiance on the tissue, including the power, distance, spot size, orientation, and exposure time. This complex interaction lends itself to robotic automation, allowing the surgeon to focus on high-level tasks, such as choosing the region and method of ablation, while the lower-level ablation plan can be handled autonomously. This paper describes a sampling-based model predictive control (MPC) scheme to plan ablation sequences for arbitrary tissue volumes. Using a steady-state point ablation model to simulate a single laser-tissue interaction, a random search technique explores the reachable state space while preserving sensitive tissue regions. The sampled MPC strategy provides an ablation sequence that accounts for parameter uncertainty without violating constraints, such as avoiding nerve bundles.

17:00-17:05, Paper TuDT15.6	Add to My Program
SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks

Moghani, Masoud	University of Toronto
Nelson, Nigel	NVIDIA
Ghanem, Mohamed	Georgia Institute of Technology
Diaz-Pinto, Andres	NVIDIA
Hari, Kush	UC Berkeley
Azizian, Mahdi	Intuitive Surgical
Goldberg, Ken	UC Berkeley
Huver, Sean	NVIDIA
Garg, Animesh	Georgia Institute of Technology
Keywords: Surgical Robotics: Planning, Surgical Robotics: Laparoscopy, Medical Robots and Systems Abstract: Behavior cloning facilitates the learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data to solve fundamental tasks in surgical autonomy. We present SuFIA-BC: visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enables a comprehensive evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings highlight the importance of customizing perception pipelines and control architectures, as well as curating larger-scale synthetic datasets that meet the specific demands of surgical tasks. Project website: orbit-surgical.github.io/sufia-bc/


TuDT16 Regular Session, 404	Add to My Program
Manipulation 4

Chair: Agrawal, Pulkit	MIT
Co-Chair: Bauza Villalonga, Maria	Massachusetts Institute of Technology

16:35-16:40, Paper TuDT16.1	Add to My Program
Vegetable Peeling: A Case Study in Constrained Dexterous Manipulation

Chen, Tao	Massachusetts Institute of Technology
Cousineau, Eric	Toyota Research Institute
Kuppuswamy, Naveen	Toyota Research Institute
Agrawal, Pulkit	MIT
Keywords: Dexterous Manipulation, In-Hand Manipulation, Reinforcement Learning Abstract: Recent studies have made significant progress in addressing dexterous manipulation problems, particularly in in-hand object reorientation. However, there are few existing works that explore the potential utilization of developed dexterous manipulation controllers for downstream tasks. In this study, we focus on constrained dexterous manipulation for food peeling. Food peeling presents various constraints on the reorientation controller, such as the requirement for the hand to securely hold the object after reorientation for peeling. We propose a simple system for learning a reorientation controller that facilitates the subsequent peeling task.

16:40-16:45, Paper TuDT16.2	Add to My Program
Prompt-Responsive Object Retrieval with Memory-Augmented Student-Teacher Learning

Mosbach, Malte	University of Bonn
Behnke, Sven	University of Bonn
Keywords: Reinforcement Learning, Dexterous Manipulation, Grasping Abstract: Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at https://memory-student-teacher.github.io.

16:45-16:50, Paper TuDT16.3	Add to My Program
Implicit Articulated Robot Morphology Modeling with Configuration Space Neural Signed Distance Functions

Chen, Yiting	Rice University
Gao, Xiao	École Polytechnique Fédérale De Lausanne
Yao, Kunpeng	Massachusetts Institute of Technology
Niederhauser, Loïc	EPFL
Bekiroglu, Yasemin	Chalmers University of Technology, University College London
Billard, Aude	EPFL
Keywords: Manipulation Planning, Collision Avoidance, Grasping Abstract: In this paper, we introduce a novel approach to implicitly encode precise robot morphology using forward kinematics based on a configuration space signed distance function. Our proposed Robot Neural Distance Function (RNDF) optimizes the balance between computational efficiency and accuracy for signed distance queries conditioned on the robot's configuration for each link. Compared to the baseline method, the proposed approach achieves an 81.1% reduction in distance error while utilizing only 47.6% of model parameters. Its parallelizable and differentiable nature provides direct access to joint-space derivatives, enabling a seamless connection between robot planning in Cartesian task space and configuration space. These features make RNDF an ideal surrogate model for general robot optimization and learning in 3D spatial planning tasks. Specifically, we apply RNDF to robotic arm-hand modeling and demonstrate its potential as a core platform for whole-arm, collision-free grasp planning in cluttered environments. The code and model are available at https://github.com/robotic-manipulation/RNDF.

16:50-16:55, Paper TuDT16.4	Add to My Program
A Data-Efficient Progressive Learning Framework for Robot Scooping Task

Wang, Shuai	Tencent
Entang, Wang	Saarland University
Huang, Bidan	Tencent
Zhang, Chong	Tencent
Wang, Wei	Harbin Institute of Technology, Shenzhen
Zheng, Yu	Tencent
Keywords: Manipulation Planning, Grippers and Other End-Effectors Abstract: Robot scooping is a challenging and important task in robotic tool manipulation research due to the complex relationship between the robot, the tool, and target objects/environment. Taking into account different tools, different target objects and varying environments, the required scooping manipulation strategy usually varies greatly. Even considering a specific type of spoon, the question of how to obtain a policy model that requires less demonstration data but shows better generalization capabilities deserves further exploration. In this paper, we propose a progressive learning framework for general robot scooping tasks, which requires a limited number of demonstrations but shows promising generalization capability. We first learn a scooping policy via human demonstrations with a specific setup. We then use this as a pre-train model for reinforcement learning in a curriculum manner to achieve a scooping strategy that is generalizable to different task setups. Finally, we evaluate the capabilities of the policy with a series of experiments both in simulation and on a real robot.

16:55-17:00, Paper TuDT16.5	Add to My Program
Manipulability Transfer and Tracking Control: Bridging Domain Adaptation with Predictive Feasibility

Gong, Yuhe	University of Nottingham
Xing, Hao	Technical University of Munich (TUM)
Guo, Yu	Technical University of Munich
Figueredo, Luis	University of Nottingham (UoN)
Keywords: Manipulation Planning, Learning from Demonstration, Human Factors and Human-in-the-Loop Abstract: This paper introduces a novel framework for improving human-to-robot manipulability transfer and tracking in Learning by Demonstration. Our approach addresses key challenges, including manipulability ellipsoid (ME) domain adaptation between different kinematic structures, ME-IK feasibility checks and optimization across trajectories accounting for the robot's redundancy, and introducing a manipulability-aware control strategy. Leveraging a unified quadratic programming control with vector-field inequalities, our method enables robust tracking and optimization of manipulability, accommodating multiple demonstrations and the inherent variability in task execution. Experimental results demonstrate superior performance in precise tracking and force generation compared to traditional methods, highlighting the advantages of incorporating human implicit information for more effective robot control.

17:00-17:05, Paper TuDT16.6	Add to My Program
Adaptive Contact-Rich Manipulation through Few-Shot Imitation Learning with Force-Torque Feedback and Pre-Trained Object Representations

Tsuji, Chikaha	The University of Tokyo
Coronado, Enrique	National Institute of Advanced Industrial Science and Technology
Osorio, Pablo	Tokyo University of Agriculture and Technology
Venture, Gentiane	The University of Tokyo
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Force Control Abstract: Imitation learning offers a pathway for robots to perform repetitive tasks, allowing humans to focus on more engaging and meaningful activities. However, challenges arise from the need for extensive demonstrations and the disparity between training and real-world environments. This paper focuses on contact-rich tasks like wiping with soft and deformable objects, requiring adaptive force control to handle variations in wiping surface height and the sponge's physical properties. To address these challenges, we propose a novel method that integrates real-time force-torque (FT) feedback with pre-trained object representations. This approach allows robots to dynamically adjust to previously unseen changes in surface heights and sponges' physical properties. In real-world experiments, our method achieved 96% accuracy in applying the average reference force, significantly outperforming the previous method that lacked an FT feedback loop, which only achieved 4% accuracy. To evaluate the adaptability of our approach, we conducted experiments under different conditions from the training setup, involving 40 scenarios using 10 sponges with varying physical properties and 4 types of wiping surface heights, demonstrating significant improvements in the robot's adaptability by analyzing force trajectories.


TuDT17 Regular Session, 405	Add to My Program
Localization 2

Chair: Dellaert, Frank	Georgia Institute of Technology
Co-Chair: Kim, Jinwhan	KAIST

16:35-16:40, Paper TuDT17.1	Add to My Program
GS-EVT: Cross-Modal Event Camera Tracking Based on Gaussian Splatting

Liu, Tao	ShanghaiTech University
Yuan, Runze	Shanghaitech University
Ju, Yiang	Shanghaitech
Xu, Xun	ShanghaiTech University
Yang, Jiaqi	ShanghaiTech University
Meng, Xiangting	ShanghaiTech University
Lagorce, Xavier	ShanghaiTech University
Kneip, Laurent	ShanghaiTech University
Keywords: Localization, SLAM, Deep Learning for Visual Perception Abstract: Reliable self-localization is a foundational skill for many intelligent mobile platforms. This paper explores the use of event cameras for motion tracking thereby providing a solution with inherent robustness under difficult dynamics and illumination. In order to circumvent the challenge of event camera-based mapping, the solution is framed in a cross-modal way. It tracks a map representation that comes directly from frame-based cameras. Specifically, the proposed method operates on top of gaussian splatting, a state-of-the-art representation that permits highly efficient and realistic novel view synthesis. The key of our approach consists of a novel pose parametrization that uses a reference pose plus first order dynamics for local differential image rendering. The latter is then compared against images of integrated events in a staggered coarse-to-fine optimization scheme. As demonstrated by our results, the realistic view rendering ability of gaussian splatting leads to stable and accurate tracking across a variety of both publicly available and newly recorded data sequences.

16:40-16:45, Paper TuDT17.2	Add to My Program
A Coarse-To-Fine Event-Based Framework for Camera Pose Relocalization with Spatio-Temporal Retrieval and Refinement Network

Song, Yuhang	Northeastern University - China
Zhuang, Hao	Northeastern University
Jiang, Junjie	Northeastern University
Liu, Zuntao	Northeastern University of China
Fang, Zheng	Northeastern University
Keywords: Localization, SLAM, Deep Learning for Visual Perception Abstract: Most existing event-based camera pose relocalization (CPR) learning methods implicitly encode environmental information into network parameters to achieve end-to-end mapping from event stream to pose. However, these end-to-end CPR methods fail to utilize prior environmental information effectively. As the scale of the environment increases, the difficulty of this mapping relationship grows significantly, reducing the robustness of the end-to-end methods across different scenarios. To address the above issues, this paper proposes the first coarse-to-fine event-based CPR framework, which achieves a new paradigm from end-to-end pose regression network to a hierarchical approach. In the coarse localization stage, we effectively encode similarity features by incorporating the fine-grained temporal information, achieving accurate retrieval of nearby event stream. In the pose refinement stage, we present an Event Spatio-temporal Pose Refinement Network (ESPR-Net) based on the Recurrent Convolutional Neural Networks (RCNN) architecture, which is capable of learning more nuanced spatio-temporal features to achieve accurate regression of the relative pose. Finally, we conducted a comprehensive comparison on the IJRR and M3ED dataset, achieving state-of-the-art (SOTA) performance on both. Notably, our method attains a significant 83% performance improvement on the outdoor M3ED dataset.

16:45-16:50, Paper TuDT17.3	Add to My Program
Digital Beamforming Enhanced Radar Odometry

Jiang, Jingqi	Imperial College London
Xu, Shida	Imperial College London
Zhang, Kaicheng	Imperial College London
Wei, Jiyuan	Imperial College London
Wang, Jingyang	Tsinghua University
Wang, Sen	Imperial College London
Keywords: Localization, Mapping, SLAM Abstract: Radar has become an essential sensor for autonomous navigation, especially in challenging environments where camera and LiDAR sensors fail. 4D single-chip millimeter-wave radar systems, in particular, have drawn increasing attention thanks to their ability to provide spatial and Doppler information with low hardware cost and power consumption. However, most single-chip radar systems using traditional signal processing, such as Fast Fourier Transform, suffer from limited spatial resolution in radar detection, significantly limiting the performance of radar-based odometry and Simultaneous Localization and Mapping (SLAM) systems. In this paper, we develop a novel radar signal processing pipeline that integrates spatial domain beamforming techniques, and extend it to 3D Direction of Arrival estimation. Experiments using public datasets are conducted to evaluate and compare the performance of our proposed signal processing pipeline against traditional methodologies. These tests specifically focus on assessing structural precision across diverse scenes and measuring odometry accuracy in different radar odometry systems. This research demonstrates the feasibility of achieving more accurate radar odometry by simply replacing the standard FFT-based processing with the proposed pipeline. The codes are available at GitHub.

16:50-16:55, Paper TuDT17.4	Add to My Program
Fast Global Localization on Neural Radiance Field

Kong, Mangyu	Yonsei University
Lee, Jaewon	Yonsei University
Lee, Seongwon	Kookmin University
Kim, Euntai	Yonsei University
Keywords: Localization, Mapping, SLAM Abstract: Neural Radiance Fields (NeRF) presented a novel way to represent scenes, allowing for high-quality 3D reconstruction from 2D images. Following its remarkable achievements, global localization within NeRF maps is an essential task for enabling a wide range of applications. Recently, Loc-NeRF demonstrated a localization approach that combines traditional Monte Carlo Localization with NeRF, showing promising results for using NeRF as an environment map. However, despite its advancements, Loc-NeRF encounters the challenge of a time-intensive ray rendering process, which can be a significant limitation in practical applications. To address this issue, we introduce Fast Loc-NeRF, which enhances efficiency and accuracy in NeRF map-based global localization. We propose a particle rejection weighting strategy that estimates the uncertainty of particles by leveraging NeRF’s inherent characteristics and incorporates them into the particle weighting process to reject abnormal particles. Additionally, Fast Loc-NeRF employs a coarse-to-fine approach, matching rendered pixels and observed images across multiple resolutions from low to high. As a result, it speeds up the costly particle update process while enhancing precise localization results. Our Fast Loc-NeRF establishes new state-of-the-art localization performance on several benchmarks, demonstrating both its accuracy and efficiency.

16:55-17:00, Paper TuDT17.5	Add to My Program
Continuous-Time Radar-Inertial and Lidar-Inertial Odometry Using a Gaussian Process Motion Prior

Burnett, Keenan	University of Toronto
Schoellig, Angela P.	TU Munich
Barfoot, Timothy	University of Toronto
Keywords: Localization, Mapping, Range Sensing, Continuous-Time Abstract: In this work, we demonstrate continuous-time radar-inertial and lidar-inertial odometry using a Gaussian process motion prior. Using a sparse prior, we demonstrate improved computational complexity during preintegration and interpolation. We use a white-noise-on-acceleration motion prior and treat the gyroscope as a direct measurement of the state while preintegrating accelerometer measurements to form relative velocity factors. Our odometry is implemented using sliding-window batch trajectory estimation. To our knowledge, our work is the first to demonstrate radar-inertial odometry with a spinning mechanical radar using both gyroscope and accelerometer measurements. We improve the performance of our radar odometry by 43% by incorporating an IMU. Our approach is efficient and we demonstrate real-time performance. Code for this paper can be found at: github.com/utiasASRL/steam_icp

17:00-17:05, Paper TuDT17.6	Add to My Program
NV-LIOM: LiDAR-Inertial Odometry and Mapping Using Normal Vectors towards Robust SLAM in Multifloor Environments

Chung, Dongha	Stradvision
Kim, Jinwhan	KAIST
Keywords: Localization, Mapping, SLAM Abstract: Over the last few decades, numerous LiDAR-inertial odometry (LIO) algorithms have been developed, demonstrating satisfactory performance across diverse environments. Most of these algorithms have predominantly been validated in open outdoor environments; however, they often encounter challenges in confined indoor settings. In such indoor environments, reliable point cloud registration becomes problematic due to the rapid changes in LiDAR scans and repetitive structural features like walls and stairs, particularly in multifloor buildings. In this paper, we present NV-LIOM, a normal vector-based LiDAR-inertial odometry and mapping framework focused on robust point cloud registration designed for indoor multifloor environments. Our approach extracts the normal vectors from the LiDAR scans and utilizes them for correspondence search to enhance the point cloud registration performance. To ensure robust registration, the distribution of the normal vector directions is analyzed, and situations of degeneracy are examined to adjust the matching uncertainty. Additionally, a viewpoint-based loop closure module is implemented to avoid wrong correspondences that are blocked by the walls. The proposed method is tested through public datasets and our own dataset. To contribute to the community, the code will be made public on https://github.com/dhchung/nv_liom.


TuDT18 Regular Session, 406	Add to My Program
Place Recognition 2

Chair: Smith, Stephen L.	University of Waterloo
Co-Chair: Aravecchia, Stephanie	Georgia Tech Lorraine - IRL 2958 GT-CNRS

16:35-16:40, Paper TuDT18.1	Add to My Program
SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition

Goswami, Raktim	New York University
Patel, Naman	New York University Tandon School of Engineering
Krishnamurthy, Prashanth	New York University Tandon School of Engineering
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Localization, Deep Learning for Visual Perception, Deep Learning Methods Abstract: Large-scale LiDAR mappings and localization leverage place recognition techniques to mitigate odometry drifts, ensuring accurate mapping. These techniques utilize scene representations from LiDAR point clouds to identify previously visited sites within a database. Local descriptors, assigned to each point within a point cloud, are aggregated to form a scene representation for the point cloud. These descriptors are also used to re-rank the retrieved point clouds based on geometric fitness scores. We propose SALSA, a novel, lightweight, and efficient framework for LiDAR place recognition. It consists of a Sphereformer backbone that uses radial window attention to enable information aggregation for sparse distant points, an adaptive self-attention layer to pool local descriptors into tokens, and a multi-layer-perceptron Mixer layer for aggregating the tokens to generate a scene descriptor. The proposed framework outperforms existing methods on various LiDAR place recognition datasets in terms of both retrieval and metric localization while operating in real-time.

16:40-16:45, Paper TuDT18.2	Add to My Program
HeRCULES: Heterogeneous Radar Dataset in Complex Urban Environment for Multi-Session Radar SLAM

Kim, Hanjun	Seoul National University
Jung, Minwoo	Seoul National University
Noh, Chiyun	Seoul National University
Jung, Sangwoo	Seoul National University
Song, Hyunho	Seoul National University
Yang, Wooseong	Seoul National University
Jang, Hyesu	Seoul National University
Kim, Ayoung	Seoul National University
Keywords: Data Sets for SLAM, SLAM, Range Sensing Abstract: Recently, radars have been widely featured in robotics for their robustness in challenging weather conditions. Two commonly used radar types are spinning radars and phased-array radars, each offering distinct sensor characteristics. Existing datasets typically feature only a single type of radar, leading to the development of algorithms limited to that specific kind. In this work, we highlight that combining different radar types offers complementary advantages, which can be leveraged through a heterogeneous radar dataset. Moreover, this new dataset fosters research in multi-session and multi-robot scenarios where robots are equipped with different types of radars. In this context, we introduce the HeRCULES dataset, a comprehensive, multi-modal dataset with heterogeneous radars, FMCW LiDAR, IMU, GPS, and cameras. This is the first dataset to integrate 4D radar and spinning radar alongside FMCW LiDAR, offering unparalleled localization, mapping, and place recognition capabilities. The dataset covers diverse weather and lighting conditions and a range of urban traffic scenarios, enabling a comprehensive analysis across various environments. The sequence paths with multiple revisits and ground truth pose for each sensor enhance its suitability for place recognition research. We expect the HeRCULES dataset to facilitate odometry, mapping, place recognition, and sensor fusion research. The dataset and development tools are available at https://sites.google.com/view/herculesdataset.

16:45-16:50, Paper TuDT18.3	Add to My Program
NYC-Event-VPR: A Large-Scale High-Resolution Event-Based Visual Place Recognition Dataset in Dense Urban Environments

Pan, Taiyi	New York University
He, Junyang	University of Virginia
Chen, Chao	New York University
Li, Yiming	New York University
Feng, Chen	New York University
Keywords: Data Sets for Robotic Vision, Localization, Computer Vision for Transportation Abstract: Visual place recognition (VPR) enables autonomous robots to identify previously visited locations, which contributes to tasks like simultaneous localization and mapping (SLAM). VPR faces challenges such as accurate image neighbor retrieval and appearance change in scenery. Event cameras, also known as dynamic vision sensors, are a new sensor modality for VPR and offer a promising solution to the challenges with their unique attributes: high temporal resolution (1MHz clock), ultra-low latency (in μs), and high dynamic range (>120dB). These attributes make event cameras less susceptible to motion blur and more robust in variable lighting conditions, making them suitable for addressing VPR challenges. However, the scarcity of event-based VPR datasets, partly due to the novelty and cost of event cameras, hampers their adoption. To fill this data gap, our paper introduces the NYC-Event-VPR dataset to the robotics and computer vision communities, featuring the Prophesee IMX636 HD event sensor (1280x720 resolution), combined with RGB camera and GPS module. It encompasses over 13 hours of geotagged event data, spanning 260 kilometers across New York City, covering diverse lighting and weather conditions, day/night scenarios, and multiple visits to various locations. Furthermore, our paper employs three frameworks to conduct generalization performance assessments, promoting innovation in event-based VPR and its integration into robotics applications.

16:50-16:55, Paper TuDT18.4	Add to My Program
ZeroSCD: Zero-Shot Street Scene Change Detection

Kannan, Shyam Sundar	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Mapping Abstract: Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.

16:55-17:00, Paper TuDT18.5	Add to My Program
SPR: Single-Scan Radar Place Recognition

Casado Herraez, Daniel	University of Bonn & CARIAD SE
Chang, Le	University of Stuttgart
Zeller, Matthias	CARIAD SE
Wiesmann, Louis	University of Bonn
Behley, Jens	University of Bonn
Heidingsfeld, Michael	CARIAD SE
Stachniss, Cyrill	University of Bonn
Keywords: Localization, SLAM, Autonomous Vehicle Navigation Abstract: Localization is a crucial component for the navigation of autonomous vehicles. It encompasses global localization and place recognition, allowing a system to identify locations that have been mapped or visited before. Place recognition is commonly approached using cameras or LiDARs. However, these sensors are affected by bad weather or low lighting conditions. In this paper, we exploit automotive radars to address the problem of localizing a vehicle within a map using single radar scans. The effectiveness of radars is not dependent on environmental conditions, and they provide additional information not present in LiDARs such as Doppler velocity and radar cross section. However, the sparse and noisy radar measurement makes place recognition a challenge. Recent research in automotive radars addresses the sensor's limitations by aggregating multiple radar scans and using high-dimensional scene representations. We, in contrast, propose a novel neural network architecture that focuses on each point of single radar scans, without relying on an additional odometry input for scan aggregation. We extract pointwise local and global features, resulting in a compact scene descriptor vector. Our model improves local feature extraction by estimating the importance of each point for place recognition and enhances the global descriptor by leveraging the radar cross section information provided by the sensor. We evaluate our model using nuScenes and the 4DRadarDataset, which involve 2D and 3D automotive radar sensors. Our findings illustrate that our approach achieves state-of-the-art results for single-scan place recognition using automotive radars.

17:00-17:05, Paper TuDT18.6	Add to My Program
Improving Visual Place Recognition Based Robot Navigation by Verifying Localization Estimates

Claxton, Owen Thomas	Queensland University of Technology
Malone, Connor	Queensland University of Technology
Carson, Helen	Queensland University of Technology
Ford, Jason	Queensland University of Technology
Bolton, Gabriel Joseph	Australian National University
Shames, Iman	The Australian National University
Milford, Michael J	Queensland University of Technology
Keywords: Localization, Acceptability and Trust, Vision-Based Navigation Abstract: Visual Place Recognition (VPR) systems often have imperfect performance, affecting the `integrity' of position estimates and subsequent robot navigation decisions. Previously, SVM classifiers have been used to monitor VPR integrity. This research introduces a novel Multi-Layer Perceptron (MLP) integrity monitor which demonstrates improved performance and generalizability, removing per-environment training and reducing manual tuning requirements. We test our proposed system in extensive real-world experiments, presenting two real-time integrity-based VPR verification methods: a single-query rejection method for robot navigation to a goal zone (Experiment 1); and a history-of-queries method that takes a best, verified, match from its recent trajectory and uses an odometer to extrapolate a current position estimate (Experiment 2). Noteworthy results for Experiment 1 include a decrease in aggregate mean along-track goal error from ≈9.8m to ≈3.1m, and an increase in the aggregate rate of successful mission completion from ≈41% to ≈55%. Experiment 2 showed a decrease in aggregate mean along-track localization error from ≈2.0m to ≈0.5m, and an increase in the aggregate localization precision from ≈97% to ≈99%. Overall, our results demonstrate the practical usefulness of a VPR integrity monitor in real-world robotics to improve VPR localization and consequent navigation performance.


TuDT19 Regular Session, 407	Add to My Program
Tactile Sensing and Manipulation

Chair: Roehrbein, Florian	Chemnitz University of Technology
Co-Chair: Wang, Chunpeng	The Robotics and AI Institute

16:35-16:40, Paper TuDT19.1	Add to My Program
Shared Control for Cable Routing with Tactile Sensing

Bao, Ange	Zhejiang Univeristy
Zheng, Haoran	Zhejiang University
Shi, Xiaohang	Zhejiang University
Zhao, Pei	Zhejiang University
Keywords: Telerobotics and Teleoperation, Dexterous Manipulation, Force and Tactile Sensing Abstract: Multi-stage deformable linear object manipulation, such as cable routing, is the common and necessary part of human life and industry. However, autonomous robots still lack the dexterity and generalization required for these complex tasks. Direct teleoperation is an alternative approach, but the absence of reliable force and haptic feedback methods undermines its robustness and efficiency. This paper proposes a shared control method based on tactile sensing to address a multi-stage, contact-rich cable routing task. The proposed method allows human and robotic autonomy to share control of the robot platform. An action primitive vocabulary is constructed, incorporating adaptive authority allocation between human and autonomy, to generate motions for specific task stages. These allocations modulate the control weights of human and autonomy in accordance with the requirements of task stages. The method selects primitives from this vocabulary based on the tactile data and human intention. The effectiveness of our approach is demonstrated through a task involving straightening a cable and slotting it into a clip. We compare its performance with alternative methods and present that our method has a higher success rate and takes less time than direct teleoperation.

16:40-16:45, Paper TuDT19.2	Add to My Program
Whisker-Based Active Tactile Perception for Contour Reconstruction

Dang, Yixuan	Technische Universität München
Xu, Qinyang	TU München
Zhang, Yu	Technical University of Munich
Yao, Xiangtong	Technical University of Munich
Zhang, Liding	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Roehrbein, Florian	Chemnitz University of Technology
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Biologically-Inspired Robots, Force and Tactile Sensing, Sensor-based Control Abstract: Perception using whisker-inspired tactile sensors currently faces a major challenge: the lack of active control in robots based on direct contact information from the whisker. To accurately reconstruct object contours, it is crucial for the whisker sensor to continuously follow and maintain an appropriate relative touch pose on the surface. This is especially important for localization based on tip contact, which has a low tolerance for sharp surfaces and must avoid slipping into tangential contact. In this paper, we first construct a magnetically transduced whisker sensor featuring a compact and robust suspension system composed of three flexible spiral arms. We develop a method that leverages a characterized whisker deflection profile to directly extract the tip contact position using gradient descent, with a Bayesian filter applied to reduce fluctuations. We then propose an active motion control policy to maintain the optimal relative pose of the whisker sensor against the object surface. A B-Spline curve is employed to predict the local surface curvature and determine the sensor orientation. Results demonstrate that our algorithm can effectively track objects and reconstruct contours with sub-millimeter accuracy. Finally, we validate the method in simulations and real-world experiments where a robot arm drives the whisker sensor to follow the surfaces of three different objects.

16:45-16:50, Paper TuDT19.3	Add to My Program
CDM: Contact Diffusion Model for Multi-Contact Point Localization

Han, Seo Wook	Korean Advanced Institute of Science and Technology
Kim, Min Jun	KAIST
Keywords: Physical Human-Robot Interaction, Probabilistic Inference Abstract: In this paper, we propose a Contact Diffusion Model (CDM), a novel learning-based approach for multi-contact point localization. We consider a robot equipped with joint torque sensors and a force/torque sensor at the base. By leveraging a diffusion model, CDM addresses the singularity where multiple pairs of contact points and forces produce identical sensor measurements. We formulate CDM to be conditioned on past model outputs to account for the time-dependent characteristics of the multi-contact scenarios. Moreover, to effectively address the complex shape of the robot surfaces, we incorporate the signed distance field in the denoising process. Consequently, CDM can localize contacts at arbitrary locations with high accuracy. Simulation and real-world experiments demonstrate the effectiveness of the proposed method. In particular, CDM operates at 15.97ms and, in the real world, achieves an error of 0.44cm in single-contact scenarios and 1.24cm in dual-contact scenarios.

16:50-16:55, Paper TuDT19.4	Add to My Program
Force Admittance Control of an Underactuated Gripper with Full-State Feedback

Wang, Chunpeng	Northeastern University
Nguyen, David	Massachusetts Institute of Technology
Teoh, Zhi Ern	Harvard University
O'Neill, Ciarán Tomás	Massachusetts Institute of Technology
Odhner, Lael	Boston Dynamics AI Institute
Whitney, John Peter	Northeastern University
Estrada, Matthew	École Polytechnique Fédérale De Lausanne
Keywords: Haptics and Haptic Interfaces, Grippers and Other End-Effectors, Force Control Abstract: We present admittance control and fingertip contact detection with a linkage gripper remotely driven by a pneumatic rolling diaphragm actuator. The gripper is driven by underactuated mechanisms sensorized by joint encoders in order to fully determine the gripper state. We present the modelling of the linkage and fluidic transmission, validate its ability to regulate pinch force within an RMS error well under 0.5 Newtons via admittance control, and show the ability to detect contact at targeted locations on the linkage. In addition, we demonstrate simple grasping behaviors: blindly searching for an unobstructed object and detecting object loss. Our results show that an integrative approach of instrumenting underactuated gripper mechanisms can result in a lightweight gripper that is not only mechanically adaptive but sensitive enough to react to contact events without distal sensors or vision.

16:55-17:00, Paper TuDT19.5	Add to My Program
GenTact Toolbox: A Computational Design Pipeline to Procedurally Generate Context-Driven 3D Printed Whole-Body Artificial Skins

Kohlbrenner, Carson	University of Colorado Boulder
Escobedo, Caleb	University of Colorado - Boulder
Bae, S. Sandra	CU Boulder
Dickhans, Alexander	University of Colorado Boulder
Roncone, Alessandro	University of Colorado Boulder
Keywords: Physical Human-Robot Interaction, Touch in HRI, Multi-Contact Whole-Body Motion Planning and Control Abstract: Developing whole-body tactile skins for robots remains a challenging task, as existing solutions often prioritize modular, one-size-fits-all designs, which, while versatile, fail to account for the robot’s specific shape and the unique demands of its operational context. In this work, we introduce GenTact Toolbox, a computational pipeline for creating versatile whole-body tactile skins tailored to both robot shape and application domain. Our method includes procedural mesh generation for conforming to a robot’s topology, task-driven simulation to refine sensor distribution, and multi-material 3D printing for shape-agnostic fabrication. We validate our approach by creating and deploying six capacitive sensing skins on a Franka Research 3 robot arm in a human-robot interaction scenario. This work represents a shift from “one-size-fits-all” tactile sensors toward context-driven, highly adaptable designs that can be customized for a wide range of robotic systems and applications. The project website is available at https://hiro-group.ronc.one/gentacttoolbox

17:00-17:05, Paper TuDT19.6	Add to My Program
Human-Robot Collaborative Cable-Suspended Manipulation with Contact Distinction

Cortigiani, Giovanni	University of Siena
Malvezzi, Monica	University of Siena
Prattichizzo, Domenico	University of Siena
Pozzi, Maria	University of Siena
Keywords: Human-Centered Robotics, Human-Robot Collaboration, Physical Human-Robot Interaction Abstract: The collaborative transportation of objects between humans and robots is a fundamental task in physical human-robot interaction. Most of the literature considers the rigid co-grasping of non-deformable items in which both the human and the robot directly hold the transported object with their hands. In this paper, we implement a control strategy for the collaborative manipulation of a cable-suspended platform. The latter is an articulated and partially deformable object that can be used as a base where to place the transported object. In this way, the human and the robot are not rigidly coupled, ensuring a greater flexibility in the partners' motions and a safer interaction. However, the uncertain dynamics of the platform introduces a greater possibility of unintended collisions with external objects, which must be distinguished from contacts arising when a load is placed on or removed from the platform. This paper proposes a contact detection and distinction strategy to address this challenge. The proposed cable-suspended manipulation framework is based only on force sensing at the robot end-effector, and was tested with ten users.


TuDT20 Regular Session, 408	Add to My Program
Robot Interaction Interfaces

Chair: Kazanzides, Peter	Johns Hopkins University
Co-Chair: Rettinger, Maximilian	Technical University of Munich

16:35-16:40, Paper TuDT20.1	Add to My Program
Interactive Motion Planning for a 7-DOF Robot

Greene, Nicholas	Johns Hopkins University
Pryor, Will	Johns Hopkins University
Wang, Liam	Johns Hopkins University
Kazanzides, Peter	Johns Hopkins University
Keywords: Telerobotics and Teleoperation, Motion and Path Planning, Human Factors and Human-in-the-Loop Abstract: The use of robots in high-risk and extreme environments is crucial for tasks that are dangerous or inaccessible to humans and require high precision. Particularly in scenarios where the cost of failure is high, remote human teleoperation can be the preferred method of robot control due to the adaptability and high-level decision making of humans. Teleoperation brings many challenges including lack of accurate prior knowledge about the environment, limited views of the environment by on-board sensors, and especially inconsistent latency. 7-DOF (degrees of freedom) manipulators provide redundancy which can be utilized for increased flexibility in manipulation, and may be preferred to 6-DOF manipulators in many scenarios. The redundancy, however, must be considered by the teleoperation system. We present an extension to an existing Interactive Planning and Supervised Execution (IPSE) system that enables full teleoperation of a 7-DOF robot by encoding the redundant degree of freedom with a Shoulder-Elbow-Wrist (SEW) angle, which is user-manipulable via an SEW angle graph. Additionally, we introduce a novel user interface feature that encodes robot state information into a 2D image which is displayed directly on the SEW angle graph. We conduct a user-study which demonstrates that the addition of this SEW graph significantly reduces task completion time.

16:40-16:45, Paper TuDT20.2	Add to My Program
A Hybrid User Interface Combining AR, Desktop, and Mobile Interfaces for Enhanced Industrial Robot Programming

Krieglstein, Jan	University of Stuttgart
Kolberg, Jan	Fraunhofer IPA
Sousa Calepso, Aimée	University of Stuttgart
Kraus, Werner	Fraunhofer IPA
Sedlmair, Michael	University of Stuttgart
Keywords: Virtual Reality and Interfaces, Software Tools for Robot Programming, Assembly Abstract: Robot programming for complex assembly tasks is challenging and demands expert knowledge. With Augmented Reality (AR), immersive 3D visualization can be placed in the robot’s intrinsic coordinate system to support robot programming. However, AR interfaces introduce usability challenges. To address these, we introduce a hybrid user interface (HUI) that combines a 2D desktop, a smartphone, and an AR head-mounted display (HMD) application, enabling operators to choose the most suitable device for each sub-task. The evaluation with an expert user study shows that an HUI can enhance efficiency and user experience by selecting the appropriate device for each sub-task. Generally, the HMD is preferred for tasks involving 3D content, the desktop for creating the program structure and parametrization, and the smartphone for mobile parametrization. However, the device selection depends on individual user characteristics and their familiarity with the devices.

16:45-16:50, Paper TuDT20.3	Add to My Program
Enhancing AR-To-Robot Registration Accuracy: A Comparative Study of Marker Detection Algorithms and Registration Parameters

Mielke, Tonia	Otto-Von-Guericke University Magdeburg
Heinrich, Florian	Otto-Von-Guericke University Magdeburg
Hansen, Christian	Otto-Von-Guericke University Magdeburg
Keywords: Virtual Reality and Interfaces, Visual Tracking Abstract: Augmented Reality (AR) offers potential for enhancing human-robot collaboration by enabling intuitive interaction and real-time feedback. A crucial aspect of AR-robot integration is accurate spatial registration to align virtual content with the physical robotic workspace. This paper systematically investigates the effects of different tracking techniques and registration parameters on AR-to-robot registration accuracy, focusing on paired-point methods. We evaluate four marker detection algorithms - ARToolkit, Vuforia, ArUco, and retroreflective tracking - analyzing the influence of viewing distance, angle, marker size, point distance, distribution, and quantity. Our results show that ARToolkit provides the highest registration accuracy. While larger markers and positioning registration point centroids close to target locations consistently improved accuracy, other factors such as point distance and quantity were highly dependent on the tracking techniques used. Additionally, we propose an effective refinement method using point cloud registration, significantly improving accuracy by integrating data from points recorded between registration locations. These findings offer practical guidelines for enhancing AR-robot registration, with future work needed to assess the transferability to other AR devices and robots.

16:50-16:55, Paper TuDT20.4	Add to My Program
Sketch-MoMa: Teleoperation for Mobile Manipulator Via Interpretation of Hand-Drawn Sketches

Tanada, Kosei	Toyota Motor Corporation
Iwanaga, Yuka	Toyota Motor Corporation
Tsuchinaga, Masayoshi	Toyota Motor Corporation
Nakamura, Yuji	Toyota Motor Corporation
Mori, Takemitsu	Toyota Motor Corporation
Sakai, Remi	Aichi Institute of Technology
Yamamoto, Takashi	Aichi Institute of Technology
Keywords: Telerobotics and Teleoperation, Mobile Manipulation, Task Planning Abstract: To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work requires additional modalities to set the sketches’ semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies more detailed intentions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants.

16:55-17:00, Paper TuDT20.5	Add to My Program
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Zhi, Peiyuan	Beijing Institute for General Artificial Intelligence
Zhang, Zhiyuan	Tsinghua University
Zhao, Yu	Beijing Institute for General Artificial Intelligence
Han, Muzhi	Hillbot, Inc
Zhang, Zeyu	Beijing Institute for General Artificial Intelligence
Li, Zhitian	Beijing Institute for General Artificial Intelligence
Jiao, Ziyuan	Beijing Institute for General Artificial Intelligence
Jia, Baoxiong	Beijing Institute for General Artificial Intelligence
Huang, Siyuan	Beijing Institute for General Artificial Intelligence
Keywords: Domestic Robotics, Task Planning, Failure Detection and Recovery Abstract: Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~25%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

17:00-17:05, Paper TuDT20.6	Add to My Program
Optimizing Robot Programming: Mixed Reality Gripper Control

Rettinger, Maximilian	Technical University of Munich
Hacker, Leander	Technical University of Munich (TUM)
Wolters, Philipp	Technical University of Munich
Rigoll, Gerhard	Technische Universität München
Keywords: Virtual Reality and Interfaces, Industrial Robots, Design and Human Factors Abstract: Conventional robot programming methods are complex and time-consuming for users. In recent years, alternative approaches such as mixed reality have been explored to address these challenges and optimize robot programming. While the findings of the mixed reality robot programming methods are convincing, most existing methods rely on gesture interaction for robot programming. Since controller-based interactions have proven to be more reliable, this paper examines three controller-based programming methods within a mixed reality scenario: 1) Classical Jogging, where the user positions the robot's end effector using the controller's thumbsticks, 2) Direct Control, where the controller's position and orientation directly corresponds to the end effector's, and 3) Gripper Control, where the controller is enhanced with a 3D-printed gripper attachment to grasp and release objects. A within-subjects study (n=30) was conducted to compare these methods. The findings indicate that the Gripper Control condition outperforms the others in terms of task completion time, user experience, mental demand, and task performance, while also being the preferred method. Therefore, it demonstrates promising potential as an effective and efficient approach for future robot programming. Video available at https://youtu.be/83kWr8zUFIQ.


TuDT21 Regular Session, 410	Add to My Program
Reinforcement Learning 4

Chair: Biza, Ondrej	Robotics and AI Institute
Co-Chair: Scheutz, Matthias	Tufts University

16:35-16:40, Paper TuDT21.1	Add to My Program
MJPR: Multi-Modal Joint Predictive Representation in Deep Reinforcement Learning

Wang, Zehan	Northwestern Polytechnical University
He, Ziming	Northwestern Polytechnical University
Wang, ZiJia	Northwestern Polytechnical University
He, Hua	Northwestern Polytechnical University
Yang, Beiya	University of Strathclyde
Shi, Hao-Bin	Northwestern Polytechnical University, School of Computer Science
Keywords: Reinforcement Learning, Representation Learning, Sensor Fusion Abstract: Multi-modal reinforcement learning (RL) has been brought into focus due to its ability to provide complementary information from different sensors, enriching observations of agents. However, the introduction of multi-modal high-dimensional observations brings challenges to sample efficiency. There is a lack of research on how to efficiently obtain multi-modal latent states while encouraging them to generate complementary information. To address this, we propose a representation learning method, Multi-modal Joint Predictive Representation (MJPR), which utilizes multi-modal interactive information to predict future latent states. The joint prediction method achieves the representation training for modalities and promotes each modality to generate complementary information related to predictions of each other. In addition, we introduce multi-modal loss balancing to prompt training equilibrium and cross-modal contrastive learning (CMCL) to align the modalities for effective modal interaction. We establish the multi-modal environments in the Deepmind Control suite (DMC) and Webots and compare our method with current RL representation methods. Experimental results show that MJPR outperforms state-of-the-art methods by an average of 12.0% on six subtasks in DMC environments. It outperforms advanced methods by 16.7% and 55.4% in simple tasks and complex tasks of Webots environment, respectively. Moreover, ablation experiments are established in the DMC environment to verify the importance of each module to MJPR.

16:40-16:45, Paper TuDT21.2	Add to My Program
FLEX: A Framework for Learning Robot-Agnostic Force-Based Skills Involving Sustained Contact Object Manipulation

Fang, Shijie	Tufts University
Gao, Wenchang	Tufts University
Goel, Shivam	Tufts University
Thierauf, Christopher	Woods Hole Oceanographic Institution
Scheutz, Matthias	Tufts University
Sinapov, Jivko	Tufts University
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Deep Learning in Grasping and Manipulation Abstract: Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms. We propose a novel framework for learning object-centric manipulation policies in textit{force space}, decoupling the robot from the object. By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead. This approach, trained in simulation on a small set of representative objects, captures object dynamics—such as joint configurations—allowing policies to generalize effectively to new, unseen objects. Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, UR5) without retraining. Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. Additionally, operating in force space enhances policy transferability across diverse robot platforms and object types. We further showcase the applicability of our method in a real-world robotic setting. Link: url{https://tufts-ai-robotics-group.github.io/FLEX/}

16:45-16:50, Paper TuDT21.3	Add to My Program
FLoRA: Sample-Efficient Preference-Based RL Via Low-Rank Style Adaptation of Reward Functions

Marta, Daniel	KTH Royal Institute of Technology
Holk, Simon	KTH Royal Institute of Technology
Vasco, Miguel	KTH Royal Institute of Technology
Lundell, Jens	Royal Institute of Technology
Homberger, Timon	KTH Royal Institute of Technology
Busch, Finn Lukas	KTH Royal Institute of Technology
Andersson, Olov	KTH Royal Institute
Kragic, Danica	KTH
Leite, Iolanda	KTH Royal Institute of Technology
Keywords: Reinforcement Learning, Human Factors and Human-in-the-Loop, Learning from Demonstration Abstract: Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks. We provide videos of our results and source code at https://sites.google.com/view/preflora/.

16:50-16:55, Paper TuDT21.4	Add to My Program
On-Robot Reinforcement Learning with Goal-Contrastive Rewards

Biza, Ondrej	Robotics and AI Institute
Weng, Thomas	Boston Dynamics AI Institute
Sun, Lingfeng	University of California, Berkeley
Schmeckpeper, Karl	University of Pennslyvania
Kelestemur, Tarik	Northeastern University
Ma, Yecheng Jason	University of Pennsylvania
Platt, Robert	Northeastern University
van de Meent, Jan-Willem	University of Amsterdam
Wong, Lawson L.S.	Northeastern University
Keywords: Reinforcement Learning Abstract: Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose Goal-Contrastive Rewards (GCR), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task.

16:55-17:00, Paper TuDT21.5	Add to My Program
Watch Less, Feel More: Sim-To-Real RL for Generalizable Articulated Object Manipulation Via Motion Adaptation and Impedance Control

Do, Tan-Dzung	Peking University
Nandiraju, Gireesh	Peking University
Wang, Jilong	Galaxy General Robot Co., Ltd
Wang, He	Peking University
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Compliance and Impedance Control Abstract: Articulated object manipulation poses a unique challenge compared to rigid object manipulation as the object itself represents a dynamic environment. In this work, we present a novel RL-based pipeline equipped with variable impedance control and motion adaptation leveraging observation history for generalizable articulated object manipulation, focusing on smooth and dexterous motion during zero-shot sim-to-real transfer. To mitigate the sim-to-real gap, our pipeline diminishes reliance on vision by not leveraging the vision data feature (RGBD/pointcloud) directly as policy input but rather extracting useful low-dimensional data first via off-the-shelf modules. Additionally, we experience less sim-to-real gap by inferring object motion and its intrinsic properties via observation history as well as utilizing impedance control both in the simulation and in the real world. Furthermore, we develop a well-designed training setting with great randomization and a specialized reward system (task-aware and motion-aware) that enables multi-staged, end-to-end manipulation without heuristic motion planning. To the best of our knowledge, our policy is the first to report 84% success rate in the real world via extensive experiments with various unseen objects. Webpage: https://watch-less-feel-more.github.io/

17:00-17:05, Paper TuDT21.6	Add to My Program
From Imitation to Refinement -- Residual RL for Precise Assembly

Ankile, Lars	Massachusetts Institute of Technology
Simeonov, Anthony	Massachusetts Institute of Technology
Shenfeld, Idan	MIT
Torne Villasevil, Marcel	Stanford University
Agrawal, Pulkit	MIT
Keywords: Reinforcement Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation Abstract: Recent advances in Behavior Cloning (BC) have made it easy to teach robots new tasks. However, we find that the ease of teaching comes at the cost of unreliable performance that saturates with increasing data for tasks requiring precision. The performance saturation can be attributed to two critical factors: (a) distribution shift resulting from the use of offline data and (b) the lack of closed-loop corrective control caused by action chucking (predicting a set of future actions executed open-loop) critical for BC performance. Our key insight is that by predicting action chunks, BC policies function more like trajectory "planners" than closed-loop controllers necessary for reliable execution. To address these challenges, we devise a simple yet effective method, ResiP (Residual for Precise Manipulation), that overcomes the reliability problem while retaining BC’s ease of teaching and long-horizon capabilities. ResiP augments a frozen, chunked BC model with a fully closed-loop residual policy trained with reinforcement learning (RL) that addresses distribution shifts and introduces closed-loop corrections over open-loop execution of action chunks predicted by the BC trajectory planner.


TuDT22 Regular Session, 411	Add to My Program
Imitation Learning 1

Chair: Kuo, Yen-Ling	University of Virginia
Co-Chair: Ramirez-Amaro, Karinne	Chalmers University of Technology

16:35-16:40, Paper TuDT22.1	Add to My Program
Fast Policy Synthesis with Variable Noise Diffusion Models

Høeg, Sigmund Hennum	Norwegian University of Science and Technology
Du, Yilun	MIT
Egeland, Olav	NTNU
Keywords: Imitation Learning, Learning from Demonstration, AI-Based Methods Abstract: Diffusion models have seen rapid adoption in robotic imitation learning, enabling autonomous execution of complex dexterous tasks. However, action synthesis is often slow, requiring many steps of iterative denoising, limiting the extent to which models can be used in tasks that require fast reactive policies. To sidestep this, recent works have explored how the distillation of the diffusion process can be used to accelerate policy synthesis. However, distillation is computationally expensive and can hurt both the accuracy and diversity of synthesized actions. We propose SDP (Streaming Diffusion Policy), an alternative method to accelerate policy synthesis, leveraging the insight that generating a partially denoised action trajectory is substantially faster than a full output action trajectory. At each observation, our approach outputs a partially denoised action trajectory with variable levels of noise corruption, where the immediate action to execute is noise-free, with subsequent actions having increasing levels of noise and uncertainty. The partially denoised action trajectory for a new observation can then be quickly generated by applying a few steps of denoising to the previously predicted noisy action trajectory (rolled over by one timestep). We illustrate the efficacy of this approach, dramatically speeding up policy synthesis while preserving performance across both simulated and real-world settings. Project website: https://streaming-diffusion-policy.github.io

16:40-16:45, Paper TuDT22.2	Add to My Program
Adaptive Compliance Policy: Learning Approximate Compliance for Diffusion Guided Control

Hou, Yifan	Stanford University
Liu, Zeyi	Stanford University
Chi, Cheng	Columbia University
Cousineau, Eric	Toyota Research Institute
Kuppuswamy, Naveen	Toyota Research Institute
Feng, Siyuan	Toyota Research Institute
Burchfiel, Benjamin	Toyota Research Institute
Song, Shuran	Stanford University
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Bimanual Manipulation Abstract: Compliance plays a crucial role in manipulation, as it balances between the concurrent control of position and force under uncertainties. Yet compliance is often overlooked by today's visuomotor policies that solely focus on position control. This paper introduces Adaptive Compliance Policy (ACP), a novel framework that learns to dynamically adjust system compliance both spatially and temporally for given manipulation tasks from human demonstrations, improving upon previous approaches that rely on pre-selected compliance parameters or assume uniform constant stiffness. However, computing full compliance parameters from human demonstrations is an ill-defined problem. Instead, we estimate an approximate compliance profile with two useful properties: avoiding large contact forces and encouraging accurate tracking. Our approach enables robots to handle complex contact-rich manipulation tasks and achieves over 50% performance improvement compared to state-of-the-art visuomotor policy methods.

16:45-16:50, Paper TuDT22.3	Add to My Program
Learning Wheelchair Tennis Navigation from Broadcast Videos with Domain Knowledge Transfer and Diffusion Motion Planning

Wu, Zixuan	Georgia Institute of Technology
Zaidi, Zulfiqar	Georgia Institute of Technology
Patil, Adithya	Georgia Institute of Technology
Xiao, Qingyu	Georgia Institute of Technology
Gombolay, Matthew	Georgia Institute of Technology
Keywords: Learning from Demonstration, Transfer Learning, Vision-Based Navigation Abstract: In this paper, we propose a novel and generalizable zero-shot knowledge transfer framework that distills expert sports navigation strategies from web videos into robotic systems with adversarial constraints and out-of-distribution image trajectories. Our pipeline enables diffusion-based imitation learning by reconstructing the full 3D task space from multiple partial views, warping it into 2D image space, closing the planning loop within this 2D space, and transfer constrained motion of interest back to task space. Additionally, we demonstrate that the learned policy can serve as a local planner in conjunction with position control. We apply this framework in the wheelchair tennis navigation problem to guide the wheelchair into the ball-hitting region. Our pipeline achieves a navigation success rate of 97.67% in reaching real-world recorded tennis ball trajectories with a physical robot wheelchair, and achieve a success rate of 68.49% in a real-world, real-time experiment on a full-sized tennis court.

16:50-16:55, Paper TuDT22.4	Add to My Program
Diff-DAgger: Uncertainty Estimation with Diffusion Policy for Robotic Manipulation

Lee, Sung-Wook	University of Virginia
Kang, Xuhui	University of Virginia
Kuo, Yen-Ling	University of Virginia
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration Abstract: Recently, diffusion policy has shown impressive results in handling multi-modal tasks in robotic manipulation. However, it has fundamental limitations in out-of-distribution failures that persist due to compounding errors and its limited capability to extrapolate. One way to address these limitations is robot-gated DAgger, an interactive imitation learning with a robot query system to actively seek expert help during policy rollout. While robot-gated DAgger has high potential for learning at scale, existing methods like Ensemble-DAgger struggle with highly expressive policies: They often misinterpret policy disagreements as uncertainty at multi-modal decision points. To address this problem, we introduce Diff-DAgger, an efficient robot-gated DAgger algorithm that leverages the training objective of diffusion policy. We evaluate Diff-DAgger across different robot tasks including stacking, pushing, and plugging, and show that Diff-DAgger improves the task failure prediction by 39.0%, the task completion rate by 20.6%, and reduces the wall-clock time by a factor of 7.8. We hope that this work opens up a path for efficiently incorporating expressive yet data-hungry policies into interactive robot learning settings. The project website is available at: https://diffdagger.github.io

16:55-17:00, Paper TuDT22.5	Add to My Program
SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation

Hsu, Cheng-Chun	The University of Texas at Austin
Wen, Bowen	NVIDIA
Xu, Jie	NVIDIA
Narang, Yashraj	NVIDIA
Wang, Xiaolong	UC San Diego
Zhu, Yuke	The University of Texas at Austin
Biswas, Joydeep	University of Texas at Austin
Birchfield, Stan	NVIDIA Corporation
Keywords: Learning from Demonstration, Imitation Learning Abstract: We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We show improvement compared to prior work on RLBench simulated tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion

17:00-17:05, Paper TuDT22.6	Add to My Program
Imitation Learning with Limited Actions Via Diffusion Planners and Deep Koopman Controllers

Bi, Jianxin	National University of Singapore
Lin, Kelvin	National University of Singapore
Chen, Kaiqi	National University of Singapore
Huang, Yifei	National University of Singapore
Soh, Harold	National University of Singapore
Keywords: Imitation Learning, Learning from Demonstration, Machine Learning for Robot Control Abstract: Recent advances in diffusion-based robot policies have demonstrated significant potential in imitating multi-modal behaviors. However, these approaches typically require large quantities of demonstration data paired with corresponding robot action labels, creating a substantial data collection burden. In this work, we propose a plan-then-control framework aimed at improving the action-data efficiency of inverse dynamics controllers by leveraging observational demonstration data. Specifically, we adopt a Deep Koopman Operator framework to model the dynamical system and utilize observation-only trajectories to learn a latent action representation. This latent representation can then be effectively mapped to real high-dimensional continuous actions using a linear action decoder, requiring minimal action-labeled data. Through experiments on simulated robot manipulation tasks and a real robot experiment with multi-modal expert demonstrations, we demonstrate that our approach significantly enhances action-data efficiency and achieves high task success rates with limited action data.


TuDT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 2

Chair: Steckel, Jan	University of Antwerp
Co-Chair: Waslander, Steven	University of Toronto

16:35-16:40, Paper TuDT23.1	Add to My Program
H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision

Shi, Yunxiao	Qualcomm AI Research
Cai, Hong	Qualcomm Technologies Inc
Ansari, Amin	Qualcomm Technologies, Inc
Porikli, Fatih	Australian National University
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: 3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention-based 2D-3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, named H3O, which features highly efficient architecture designs and incurs a significantly lower computational cost as compared to the current state-of-the-art methods. In addition, to compensate for the ambiguity in ground-truth 3D occupancy labels, we advocate leveraging auxiliary tasks to complement the direct 3D supervision. In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels that introduces rich and heterogeneous supervision signals. We conduct extensive experiments on the Occ3D-nuScenes and SemanticKITTI benchmarks that demonstrate the superiority of our proposed H3O.

16:40-16:45, Paper TuDT23.2	Add to My Program
TrackOcc: Camera-Based 4D Panoptic Occupancy Tracking

Chen, Zhuoguang	Tsinghua University
Li, Kenan	New York University
Yang, Xiuyu	Tsinghua University
Jiang, Tao	Tsinghua
Li, Yiming	New York University
Zhao, Hang	Tsinghua University
Keywords: Autonomous Agents, Deep Learning for Visual Perception, Semantic Scene Understanding Abstract: Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The code will be released for future research.

16:45-16:50, Paper TuDT23.3	Add to My Program
OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction

Heidrich, Severin	RWTH Aachen University
Beemelmanns, Till	RWTH Aachen University
Nekrasov, Alexey	RWTH Aachen University
Leibe, Bastian	RWTH Aachen University
Eckstein, Lutz	Institute for Automotive Engineering, RWTH Aachen University
Keywords: Semantic Scene Understanding, Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at https://github.com/ika-rwth-aachen/OCCUQ.

16:50-16:55, Paper TuDT23.4	Add to My Program
RadarMask: A Novel End-To-End Sparse Millimeter-Wave Radar Sequence Panoptic Segmentation and Tracking Method

Guo, Yubo	School of Artifcial Intelligence andAutomation, Huazhong Univers
Peng, Gang	Huazhong University of Science and Technology
Gao, Qiang	Huazhong University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, AI-Based Methods Abstract: 在自动驾驶和机器人技术领域，雷达传感器越来越受到关注。场景理解对于自主系统的安全导航至关重要。全景分割和跟踪任务使动态、语义多层次描述环境和不同实例。然而，以前的全景分割和跟踪方法主要集中在 LiDAR 上。为了解决全景分割和跟踪的复杂挑战雷达数据，我们引入了 RadarMask，一种创新方法首次在 radar 域。我们的方法是端到端的，不需要后处理。我们还介绍了简单有效的点云功能模块和目标运动估计根据雷达的独特特性量身定制的模块点。最后，我们展示了算法，实现最先进的（SoTA）性能进行比较。这我们的方法的实现可以在以下位置找到： https://github.com/ybguo/RadarMa

16:55-17:00, Paper TuDT23.5	Add to My Program
LiDAR-BIND: Multi-Modal Sensor Fusion through Shared Latent Embeddings

Balemans, Niels	University of Antwerp - Imec, Faculty of Applied Engineering - I
Anwar, Ali	University of Antwerp-Imec
Steckel, Jan	University of Antwerp
Mercelis, Siegfried	University of Antwerp - Imec IDLab
Keywords: Deep Learning Methods, Sensor Fusion, SLAM Abstract: This paper presents LiDAR-BIND, a novel sensor fusion framework aimed at enhancing the reliability and safety of autonomous vehicles (AVs) through a shared latent embedding space. With this method, the addition of different modalities, such as sonar and radar, into existing navigation setups becomes possible. These modalities offer robust performance even in challenging scenarios where optical sensors fail. Leveraging a shared latent representation space, LiDAR-BIND enables accurate modality prediction, allowing for the translation of one sensor's observations into another, thereby overcoming the limitations of depending solely on LiDAR for dense point-cloud generation. Through this, the framework facilitates the alignment of multiple sensor modalities without the need for large synchronized datasets across all sensors. We demonstrate its usability in SLAM applications, outperforming traditional LiDAR-based approaches under degraded optical conditions.

17:00-17:05, Paper TuDT23.6	Add to My Program
Enhancing Autonomous Navigation by Imaging Hidden Objects Using Single-Photon LiDAR

Young, Aaron	MIT
Batagoda Mudiyanselage, Nevindu	University of Wisconsin - Madison
Zhang, Harry	University of Wisconsin-Madison
Dave, Akshat	MIT
Pediredla, Adithya	Dartmouth College
Negrut, Dan	University of Wisconsin
Raskar, Ramesh	MIT
Keywords: Deep Learning for Visual Perception Abstract: Robust autonomous navigation in environments with limited visibility remains a critical challenge in robotics. We present a novel approach that leverages Non-Line-of-Sight (NLOS) sensing using single-photon LiDAR to improve visibility and enhance autonomous navigation. Our method enables mobile robots to ``see around corners" by utilizing multi-bounce light information, effectively expanding their perceptual range without additional infrastructure. We propose a three-module pipeline: (1) Sensing, which captures multi-bounce histograms using SPAD-based LiDAR; (2) Perception, which estimates occupancy maps of hidden regions from these histograms using a convolutional neural network; and (3) Control, which allows a robot to follow safe paths based on the estimated occupancy. We evaluate our approach through simulations and real-world experiments on a mobile robot navigating an L-shaped corridor with hidden obstacles. Our work represents the first experimental demonstration of NLOS imaging for autonomous navigation, paving the way for safer and more efficient robotic systems operating in complex environments. We also contribute a novel dynamics-integrated transient rendering framework for simulating NLOS scenarios, facilitating future research in this domain.


TuDT24 Regular Session, 401	Add to My Program
Industrial Robots

Chair: Vanderborght, Bram	VUB
Co-Chair: Larranaga Amilibia, Jon	Mondragon Unibertsitatea

16:35-16:40, Paper TuDT24.1	Add to My Program
Visual-Based Forklift Learning System Enabling Zero-Shot Sim2Real without Real-World Data

Oishi, Koshi	Toyota Central R&d Labs., Inc
Kato, Teruki	Toyota Central R&D Labs., Inc
Makino, Hiroya	Toyota Central R&D Labs., Inc
Ito, Seigo	Toyota Central R&D Labs., Inc
Keywords: Industrial Robots, AI-Enabled Robotics, Vision-Based Navigation Abstract: Forklifts are used extensively in various industrial settings and are in high demand for automation. In particular, counterbalance forklifts are highly versatile and are employed in diverse scenarios. However, efforts to automate these processes are lacking, primarily owing to the absence of a safe and performance-verifiable development environment. This study proposes a learning system that combines a photorealistic digital learning environment with a 1/14-scale robotic forklift environment to address this challenge. Inspired by the training-based learning approach adopted by forklift operators, we employ an end-to-end vision-based deep reinforcement learning approach. The learning is conducted in a digitalized environment created from CAD data, making it safe and eliminating the need for real-world data. In addition, we safely validate the method in a physical setting using a 1/14-scale robotic forklift with a configuration similar to that of a real forklift. We achieved a 60% success rate in pallet loading tasks in real experiments using a robotic forklift. Our approach demonstrates zero-shot sim2real with a simple method that does not require heuristic additions. This learning-based approach is considered a first step towards the automation of counterbalance forklifts.

16:40-16:45, Paper TuDT24.2	Add to My Program
Strategic System Design for High Precision in Assembly Processes of CPU

Yiu, Cheuk Tung Shadow	The Hong Kong University of Science and Technology
Woo, Kam Tim	The Hong Kong University of Science and Technology
Keywords: Computer Vision for Automation, Computer Vision for Manufacturing, Industrial Robots Abstract: Robotic picking and placing played an essential role in Industrial 4.0 and have long been recognized as significant contributions to industrial processes. Various scenarios involve picking and placing parts for assembly in industrial production, such as assembling different electronic components in the manufacturing process. Those tasks require high precision to complete. However, achieving high precision in the assembly of CPUs poses a significant challenge, particularly when dealing with reflective surfaces. This paper presents a strategic system design tailored to address these challenges effectively. We focus on system device choice and optimizing the key parameters of the sensor system to strike a balance between device cost and the required precision. We use methods to construct the whole robot manipulation system, such as geometric segmentation, binocular vision with structure light projection and, based on 3D information, 6D pose estimation to construct the system. The results of our study demonstrate the practical applicability and benefits of this strategic system design in industrial settings. By meeting strict system accuracy requirements, our approach contributes to advancing industry practices and growing its impact on society.

16:45-16:50, Paper TuDT24.3	Add to My Program
The Influence of Counterbalance System on the Dynamic Characterization of Heavy Industrial Robots

Urrutia, Julen	Aldakin Automation S.L
Izquierdo, Mikel	Mondragon Unibertsitatea
Ulacia Garmendia, Ibai	Mondragon Unibertsitatea
Agirre, Nora	Aldakin Automation S.L
Inziarte, Ibai	Aldakin Automation
Larranaga Amilibia, Jon	Mondragon Unibertsitatea
Keywords: Industrial Robots, Dynamics, Hydraulic/Pneumatic Actuators Abstract: The precision of industrial robots is often limited by the relatively low stiffness of their joints, leading to positioning errors influenced by factors such as the mass and inertia of robotic links, external forces, and the counterbalance system (CBS). Counterbalance systems, typically consisting of hydropneumatic cylinders, are designed to reduce motor torque and assist in supporting heavier links. Traditionally, positioning errors in industrial robots have been corrected statically by determining pose-dependent stiffness values. However, recent numerical models incorpórate inertial effects to improve positioning error correction, making accurate inertial parameter identification essential. These parameters are typically unknown and must be determined experimentally. While methodologies for inertial parameter estimation have been extensively studied, none have accounted for the effect of the counterbalance system in this process. To address this gap, a methodology for estimating inertial parameters was applied to a heavy industrial robot, considering the influence of the counterbalance system. A comparative analysis with and without the counterbalance system showed that its inclusion improved joint torque calculation accuracy, showing the necessity of considering it in dynamic parameter characterization methodologies.

16:50-16:55, Paper TuDT24.4	Add to My Program
Deep Learning-Based Friction Compensation in Low Velocity for Enhanced Direct Teaching in Collaborative Manipulators

Choi, Seohyun	UMass Amherst
Kim, Jonghyeok	POSTECH
Chung, Wan Kyun	POSTECH
Keywords: Industrial Robots Abstract: Direct teaching in collaborative manipulators, an essential method for intuitive trajectory control, faces significant challenges due to friction in robot joints. To address this, we present a novel friction compensation framework to improve direct teaching methods for robots. Our approach focuses on mitigating friction in the joints most susceptible to frictional effects, ensuring smoother and more precise motion. The proposed framework uses deep neural networks (DNN) to model the complex friction behavior. This approach circumvents the difficulties associated with traditional friction compensation model selection. We develop specific data input preprocessing algorithms that optimize friction estimation when paired with standard encoders commonly used in collaborative robots. In addition, our custom loss function is specifically designed to improve DNN training in these low-velocity regions. To evaluate the effectiveness of our framework, we conduct comprehensive ablation studies assessing the impact of two critical components: the preprocessing algorithms and the custom loss function. These studies provide insight into the contributions of each element to overall performance. Experimental validation using two 6-DoF collaborative robots demonstrates the practical applicability and effectiveness of our approach.

16:55-17:00, Paper TuDT24.5	Add to My Program
Fixture-Free 2D Sewing Using a Dual-Arm Manipulator System (I)

Tokuda, Fuyuki	Centre for Transformative Garment Production
Murakami, Ryo	Tohoku University
Seino, Akira	Centre for Transformative Garment Production
Kobayashi, Akinari	Centre for Transformative Garment Production
Hayashibe, Mitsuhiro	Tohoku University
Kosuge, Kazuhiro	The University of Hong Kong
Keywords: Industrial Robots, Sensor-based Control, Dual Arm Manipulation Abstract: We propose a fixture-free 2D sewing system using a dual-arm manipulator, i.e., the seam lines of the top and bottom fabric parts are the same. The proposed 2D sewing system sews two stacked fabric parts together along a desired seam line printed on the top fabric part without the use of a fixture. In the proposed system, the set of aligned and stacked fabric parts is held by the end-effectors of the dual-arm manipulator in coordination. The dual-arm manipulator controls the motion of the fabric parts on the flat sewing table stitch by stitch in coordination, while keeping the manipulated fabric parts flat using the internal force applied to the set of fabric parts. A novel vision-based seam line tracking control is proposed to control the motion of the set of fabric parts along the printed seam line on the top fabric part. The convergence of the tracking error is analyzed for sewing along both straight and curved seam lines and is shown to be specified by the control parameters. Sewing experiments show that the tracking error converges to zero as analyzed. The sewing experiments also show that the newly proposed trajectory generation method, which synchronizes the coordinated motion of the manipulators and the motion of the sewing needle, is essential for achieving accurate sewing.

17:00-17:05, Paper TuDT24.6	Add to My Program
Improving the Collision Tolerance of High-Speed Industrial Robots Via Impact-Aware Path Planning and Series Clutched Actuation

Ostyn, Frederik	Ghent University
Vanderborght, Bram	VUB
Crevecoeur, Guillaume	Ghent University
Keywords: Collision tolerance assessment, Motion and Path Planning, Compliant Joint/Mechanism, Industrial Robots Abstract: Robots are more often deployed in unstructured or unpredictable environments. Particularly collisions at high speed can severely damage the drivetrains and joint bearings of robots. In order to avoid such collisions, path planners exist that adapt the robot’s original trajectory online if a collision hazard is detected. These methods require additional sensors such as cameras, are computationally costly and never flawless due to occlusions. Another approach is to incorporate a cost function that promotes collision tolerance while planning the initial trajectory. The resulting impact-aware path plan minimizes the chance of robot hardware damage if a collision would occur. Two algorithms are presented to assess collision tolerance in high-speed robots, taking into account factors such as robot pose, impact direction, and maximum intermittent loading of the gearboxes and bearings. The first algorithm is more general while the second assumes the presence of joint overload clutches that decouple upon impact. These algorithms are applied to plan an impact-aware path for a custom 6-axis series clutched actuated robot that serves as use case.

Technical Program for Tuesday May 20, 2025