ICRA 2025 Program | Thursday May 22, 2025


ThAT1 Regular Session, 302	Add to My Program
Planning and Large Language Models

Chair: Ikeuchi, Katsushi	Microsoft
Co-Chair: Paulius, David	Brown University

08:30-08:35, Paper ThAT1.1	Add to My Program
DELTA: Decomposed Efficient Long-Term Robot Task Planning Using Large Language Models

Liu, Yuchen	Robert Bosch GmbH
Palmieri, Luigi	Robert Bosch GmbH
Koch, Sebastian	Ulm University, Robert Bosch GmbH
Georgievski, IIche	University of Stuttgart
Aiello, Marco	University of Stuttgart
Keywords: Task Planning, AI-Based Methods, Planning, Scheduling and Coordination Abstract: Recent advancements in Large Language Models (LLMs) have sparked a revolution across many research fields. In robotics, the integration of common-sense knowledge from LLMs into task and motion planning has drastically advanced the field by unlocking unprecedented levels of context awareness. Despite their vast collection of knowledge, large language models may generate infeasible plans due to hallucinations or missing domain information. To address these challenges and improve plan feasibility and computational efficiency, we introduce DELTA, a novel LLM-informed task planning approach. By using scene graphs as environment representations within LLMs, DELTA achieves rapid generation of precise planning problem descriptions. To enhance planning performance, DELTA decomposes long-term task goals with LLMs into an autoregressive sequence of sub-goals, enabling automated task planners to efficiently solve complex problems. In our extensive evaluation, we show that DELTA enables an efficient and fully automatic task planning pipeline, achieving higher planning success rates and significantly shorter planning times compared to the state of the art.

08:35-08:40, Paper ThAT1.2	Add to My Program
Hey Robot! Personalizing Robot Navigation through Model Predictive Control with a Large Language Model

Martinez-Baselga, Diego	University of Zaragoza
de Groot, Oscar	Delft University of Technology
Knoedler, Luzia	Delft University of Technology
Alonso-Mora, Javier	Delft University of Technology
Riazuelo, Luis	Instituto De Investigación En IngenieríadeAragón, University of Z
Montano, Luis	Universidad De Zaragoza
Keywords: Motion and Path Planning, Human-Centered Robotics, Human-Aware Motion Planning Abstract: Robot navigation methods allow mobile robots to operate in applications such as warehouses or hospitals. While the environment in which the robot operates imposes requirements on its navigation behavior, most existing methods do not allow the end-user to configure the robot's behavior and priorities, possibly leading to undesirable behavior (e.g., fast driving in a hospital). We propose a novel approach to adapt robot motion behavior based on natural language instructions provided by the end-user. Our zero-shot method uses an existing Visual Language Model to interpret a user text query or an image of the environment. This information is used to generate the cost function and reconfigure the parameters of a Model Predictive Controller, translating the user's instruction to the robot's motion behavior. This allows our method to safely and effectively navigate in dynamic and challenging environments. We extensively evaluate our method's individual components and demonstrate the effectiveness of our method on a ground robot in simulation and real-world experiments, and across a variety of environments and user specifications.

08:40-08:45, Paper ThAT1.3	Add to My Program
Large Language Model Based Autonomous Task Planning for Abstract Commands

Kwon, Seokjoon	Korea Advanced Institute of Science and Technology
Park, Jae-Hyeon	Samsung Display
Jang, Hee-Deok	Korea Advanced Institute of Science Technology
Roh, CheolLae	Samsung Display Co
Chang, Dong Eui	KAIST
Keywords: Task Planning, Computer Vision for Automation, Robotics and Automation in Life Sciences Abstract: Recent advances in large language models (LLMs) have demonstrated exceptional reasoning capabilities in natural language processing, sparking interest in applying LLMs to task planning problems in robotics. Most studies focused on task planning for clear natural language commands that specify target objects and their locations. However, for more user-friendly task execution, it is crucial for robots to autonomously plan and carry out tasks based on abstract natural language commands that may not explicitly mention target objects or locations, such as ‘Put the food ingredients in the same place.’ In this study, we propose an LLM-based autonomous task planning framework that generates task plans for abstract natural language commands. This framework consists of two phases: an environment recognition phase and a task planning phase. In the environment recognition phase, a large vision-language model generates a hierarchical scene graph that captures the relationships between objects and spaces in the environment surrounding a robot agent. During the task planning phase, an LLM uses the scene graph and the abstract user command to formulate a plan for the given task. We validate the effectiveness of the proposed framework in the AI2THOR simulation environment, demonstrating its superior performance in task execution when handling abstract commands.

08:45-08:50, Paper ThAT1.4	Add to My Program
Self-Corrective Task Planning by Inverse Prompting with Large Language Models

Lee, Jiho	Chung-Ang University
Lee, Hayun	Chung-Ang University
Kim, Jonghyeon	Chung-Ang University
Lee, Kyungjae	Korea University
Kim, Eunwoo	Chung-Ang University
Keywords: Task Planning Abstract: In robot task planning, large language models (LLMs) have shown significant promise in generating complex and long-horizon action sequences. However, it is observed that LLMs often produce responses that sound plausible but are not accurate. To address these problems, existing methods typically employ predefined error sets or external knowledge sources, requiring human efforts and computation resources. Recently, self-correction approaches have emerged, where LLM generates and refines plans, identifying errors by itself. Despite their effectiveness, they are more prone to failures in correction due to insufficient reasoning. In this paper, we propose a novel self-corrective planning of tasks with inverse prompting, named InversePrompt, which contains reasoning steps to provide interpretable groundings for feedback. It generates the inverse actions corresponding to generated actions and verifies if these inverse actions can restore the system to its original state, thereby explicitly validating the logical flow of the generated plans. The results on benchmark datasets show an average 16.3% higher success rate over existing LLM-based task planning methods. Our approach offers clearer justifications for feedback in real-world environments, resulting in more successful task completion than existing self-correction approaches across various scenarios.

08:50-08:55, Paper ThAT1.5	Add to My Program
Traffic Regulation-Aware Path Planning with Regulation Databases and Vision-Language Models

Han, Xu	University of California Los Angeles
Wu, Zhiwen	University of California, Los Angeles
Xia, Xin	University of California, Los Angeles
Ma, Jiaqi	University of California, Los Angeles
Keywords: Motion and Path Planning, Integrated Planning and Control, Planning under Uncertainty Abstract: This paper introduces and tests a framework that integrates traffic regulation compliance into automated driving systems (ADS). The framework enables ADS to follow traffic laws and make informed decisions based on the driving environment. Using RGB camera inputs and a vision-language model (VLM), the system generates descriptive text to support a regulation-aware decision-making process, ensuring legal and safe driving practices. This information is combined with a machine-readable ADS regulation database to guide future driving plans within legal constraints. Key features include: 1) a regulation database supporting ADS decision-making, 2) an automated process using sensor input for regulation-aware path planning, and 3) validation in both simulated and real-world environments. Particularly, the real-world vehicle tests not only assess the framework's performance but also evaluate the potential and challenges of VLMs to solve complex driving problems by integrating detection, reasoning, and planning. This work enhances the legality, safety, and public trust in ADS, representing a significant step forward in the field.

08:55-09:00, Paper ThAT1.6	Add to My Program
DrPlanner: Diagnosis and Repair of Motion Planners for Automated Vehicles Using Large Language Models

Lin, Yuanfei	Technical University of Munich
Li, Chenran	University of California, Berkeley
Ding, Mingyu	UC Berkeley
Tomizuka, Masayoshi	University of California
Zhan, Wei	Univeristy of California, Berkeley
Althoff, Matthias	Technische Universität München
Keywords: Integrated Planning and Learning, Motion and Path Planning, Intelligent Transportation Systems Abstract: Motion planners are essential for the safe operation of automated vehicles across various scenarios. However, no motion planning algorithm has achieved perfection in the literature, and improving its performance is often time-consuming and labor-intensive. To tackle the aforementioned issues, we present DrPlanner, the first framework designed to automatically diagnose and repair motion planners using large language models. Initially, we generate a structured description of the planner and its planned trajectories from both natural and programming languages. Leveraging the profound capabilities of large language models in addressing reasoning challenges, our framework returns repaired planners with detailed diagnostic descriptions. Furthermore, the framework advances iteratively with continuous feedback from the evaluation of the repaired outcomes. Our approach is validated using both search- and sampling-based motion planners for automated vehicles; experimental results highlight the need for demonstrations in the prompt and show the ability of our framework to effectively identify and rectify elusive issues.


ThAT2 Regular Session, 301	Add to My Program
SLAM 5

Chair: Zelek, John S.	University of Waterloo
Co-Chair: Younès, Raoui	University Mohammed V in Rabat

08:30-08:35, Paper ThAT2.1	Add to My Program
MGS-SLAM: Monocular Sparse Tracking and Gaussian Mapping with Depth Smooth Regularization

Zhu, Pengcheng	Northeastern University
Zhuang, Yaoming	Northeastern University
Chen, Baoquan	Northeastern University
Li, Li	Northeastern University
Wu, Chengdong	Northeastern University
Liu, Zhanlin	University of Washington
Keywords: SLAM, Mapping Abstract: This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently, SLAM based on Gaussian Splatting has shown promising results. However, in monocular scenarios, the Gaussian maps reconstructed lack geometric accuracy and exhibit weaker tracking capability. To address these limitations, we jointly optimize sparse visual odometry tracking and 3D Gaussian Splatting scene representation for the first time. Estimating depth maps on visual odometry keyframes window using a fast Multi-View Stereo (MVS) network for the geometric supervision of Gaussian maps. Furthermore, we propose a depth smooth loss and Sparse-Dense Adjustment Ring (SDAR) to reduce the negative effect of estimated depth maps and preserve the consistency in scale between the visual odometry and Gaussian maps. We have evaluated our system across various synthetic and real-world datasets. The accuracy of our poses estimation surpasses existing methods and achieves state-of-the-art. Additionally, it outperforms previous monocular methods in terms of novel view synthesis and geometric reconstruction fidelities.

08:35-08:40, Paper ThAT2.2	Add to My Program
GARAD-SLAM: 3D GAussian Splatting for Real-Time Anti Dynamic SLAM

Li, Mingrui	Dalian University of Technology
Chen, Weijian	Sun Yat-Sen University
Cheng, Na	Dalian University of Technology
Xu, Jingyuan	Dalian University of Technology
Li, Dong	University of Macau
Wang, Hongyu	Dalian University of Technology
Keywords: SLAM, Mapping, Localization Abstract: The 3D Gaussian Splatting (3DGS)-based SLAM system has garnered widespread attention due to its excellent performance in real-time high-fidelity rendering. However, in real-world environments filled with dynamic objects, existing 3DGS-based SLAM systems often face mapping errors and tracking drift issues. To address this, we propose GARAD-SLAM, a real-time 3DGS-based SLAM system tailored for dynamic scenes. In terms of tracking, unlike traditional methods, we directly perform dynamic segmentation on Gaussians and map them back to the front end to obtain dynamic point labels through a Gaussian pyramid network, achieving precise dynamic removal and robust tracking. For mapping, we impose rendering penalties on dynamically labeled Gaussians updated through the network to avoid irreversible erroneous removal caused by simple pruning. Our results on real-world datasets demonstrate that our method is competitive in tracking compared to baseline methods, generating fewer artifacts and higher-quality reconstructions in rendering.

08:40-08:45, Paper ThAT2.3	Add to My Program
Optimizing NeRF-Based SLAM with Trajectory Smoothness Constraints

He, Yicheng	Southern University of Science and Technology
Chen, Guangcheng	Southern University of Science and Technology
Zhang, Hong	SUSTech
Keywords: SLAM, Localization, Mapping Abstract: The joint optimization of Neural Radiance Fields (NeRF) and camera trajectories has been widely applied in SLAM tasks due to its superior dense mapping quality and consistency. NeRF-based SLAM learns camera poses using constraints by implicit map representation. A widely observed phenomenon that results from the constraints of this form is jerky and physically unrealistic estimated camera motion, which in turn affects the map quality. To address this deficiency of current NeRF-based SLAM, we propose in this paper TS-SLAM (TS for Trajectory Smoothness). It introduces smoothness constraints on camera trajectories by representing them with uniform cubic B-splines with continuous acceleration that guarantees smooth camera motion. Benefiting from the differentiability and local control properties of B-splines, TS-SLAM can incrementally learn the control points end-to-end using a sliding window paradigm. Additionally, we regularize camera trajectories by exploiting the dynamics prior to further smooth trajectories. Experimental results demonstrate that TS-SLAM achieves superior trajectory accuracy and improves mapping quality versus NeRF-based SLAM that does not employ the above smoothness constraints.

08:45-08:50, Paper ThAT2.4	Add to My Program
MGSO: Monocular Real-Time Photometric SLAM with Efficient 3D Gaussian Splatting

Hu, Kevin	University of Waterloo
Abboud, Nicolas	American University of Beirut
Ali, Muhammad Q.	University of Waterloo
Yang, Adam Srebrnjak	University of Waterloo
Elhajj, Imad	American University of Beirut
Asmar, Daniel	American University of Beirut
Chen, Yuhao	University of Waterloo
Zelek, John S.	University of Waterloo
Keywords: SLAM, Mapping, Vision-Based Navigation Abstract: Real-time SLAM with dense 3D mapping is computationally challenging, especially on resource-limited devices. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems struggle to balance hardware simplicity, speed, and map quality. Most systems excel in one or two of the aforementioned aspects but rarely achieve all. A key issue is the difficulty of initializing 3D Gaussians while concurrently conducting SLAM. To address these challenges, we present Monocular GSO (MGSO), a novel real-time SLAM system that integrates photometric SLAM with 3DGS. Photometric SLAM provides dense structured point clouds for 3DGS initialization, accelerating optimization and producing more efficient maps with fewer Gaussians. As a result, experiments show that our system generates reconstructions with a balance of quality, memory efficiency, and speed that outperforms the state-of-the-art. Furthermore, our system achieves all results using RGB inputs. We evaluate the Replica, TUM-RGBD, and EuRoC datasets against current live dense reconstruction systems. Not only do we surpass contemporary systems, but experiments also show that we maintain our performance on laptop hardware, making it a practical solution for robotics, A/R, and other real-time applications.

08:50-08:55, Paper ThAT2.5	Add to My Program
RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes

Yu, Sicheng	HKUST(gz)
Cheng, Chong	HKUST(GZ)
Zhou, Yifan	The Hong Kong University of Science and Technology (Guangzhou)
Yang, Xiaojun	The Hong Kong University of Science and Technology (Guangzhou)
Wang, Hao	HKUST(GZ)
Keywords: Deep Learning for Visual Perception, Visual Learning, SLAM Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM, as it can produce high-fidelity novel views. However, previous GS-based methods primarily target indoor scenes and rely on RGB-D sensors or pre-trained depth estimation models, hence underperforming in outdoor scenarios. To address this issue, we propose a RGB-only gaussian splatting SLAM method for unbounded outdoor scenes—OpenGS-SLAM. Technically, we first employ a pointmap regression network to generate consistent pointmaps between frames for pose estimation. Compared to commonly used depth maps, pointmaps include spatial relationships and scene geometry across multiple views, enabling robust camera pose estimation. Then, we propose integrating the estimated camera poses with 3DGS rendering as an end-to-end differentiable pipeline. Our method achieves simultaneous optimization of camera poses and 3DGS scene parameters, significantly enhancing system tracking accuracy. Specifically, we also design an adaptive scale mapper for the pointmap regression network, which provides more accurate pointmap mapping to the 3DGS map representation. Our experiments on the Waymo dataset demonstrate that OpenGS-SLAM reduces tracking error to 9.8% of previous 3DGS methods, and achieves state-of-the-art results in novel view synthesis. Project page: https://opengsslam.github.io/.

08:55-09:00, Paper ThAT2.6	Add to My Program
FGO-SLAM: Enhancing Gaussian SLAM with Globally Consistent Opacity Radiance Field

Zhu, Fan	University of Science and Technology of China
Zhao, Yifan	University of Science and Technology of China
Chen, Ziyu	University of Science and Technology of China
Yu, Biao	Hefei Institutes of Physical Science, Chinese Academy of Science
Zhu, Hui	Hefei Institutes of Physical Science, Chinese Academy of Science
Keywords: Mapping, SLAM, Embodied Cognitive Science Abstract: Visual SLAM has regained attention due to its ability to provide perception capabilities and simulation test data for Embodied AI. However, traditional SLAM systems struggle to meet the demands of high-quality scene reconstruction, and Gaussian SLAM systems, despite their rapid rendering and high-quality mapping capabilities, lack effective pose optimization methods and face challenges in geometric reconstruction. To address these issues, we introduce FGO-SLAM, a Gaussian SLAM system that employs an opacity radiance field as the scene representation to enhance geometric mapping performance. After initial pose estimation, we apply global adjustment to optimize camera poses and sparse point cloud, ensuring robust tracking of our system. Additionally, we maintain a globally consistent opacity radiance field based on 3D Gaussians and introduce depth distortion and normal consistency terms to refine the scene representation. Furthermore, after constructing tetrahedral grids, we identify level sets to directly extract surfaces from 3D Gaussians. Results across various real-world and large-scale synthetic datasets demonstrate that our method achieves state-of-the-art tracking accuracy and mapping performance.


ThAT3 Regular Session, 303	Add to My Program
Point Cloud Registration

Chair: Fraundorfer, Friedrich	Graz University of Technology
Co-Chair: Lim, Hyungtae	Massachusetts Institute of Technology

08:30-08:35, Paper ThAT3.1	Add to My Program
Multi-View Registration of Partially Overlapping Point Clouds for Robotic Manipulation

Xie, Yuzhen	Southeast University
Song, Aiguo	Southeast University
Keywords: RGB-D Perception, Computer Vision for Automation, Data Sets for Robotic Vision Abstract: Point cloud registration is a fundamental task in intelligent robots, aiming to achieve globally consistent geometric structures and providing data support for robotic manipulation. Due to the limited view of measurement devices, it is necessary to collect point clouds from multiple views to construct a complete model. Previous multi-view registration methods rely on sufficient overlap and registering all pairs of point clouds, resulting in slow convergence and high cumulative errors. To solve these challenges, we present a multi-view registration method based on the point-to-plane model and pose graph. We introduce a robust kernel into the objective function to diminish registration errors caused by mismatched points. Additionally, an enhanced Euclidean clustering method is proposed for extracting object point clouds. Subsequently, by establishing pose constraints on non-adjacent frames of point clouds, the cumulative error is reduced, achieving global optimization based on the pose graph. Experimental results demonstrate the robustness of our method with respect to overlap ratios, successfully registering point clouds with overlap ratio exceeding 30%. In comparison to other techniques, our method can reduce the E(R) of multi-view registration by 13.54% and E(t) by 18.72%, effectively reducing the cumulative error.

08:35-08:40, Paper ThAT3.2	Add to My Program
Kinematic-ICP: Enhancing LiDAR Odometry with Kinematic Constraints for Wheeled Mobile Robots Moving on Planar Surfaces

Guadagnino, Tiziano	University of Bonn
Mersch, Benedikt	University of Bonn
Vizzo, Ignacio	Dexory
Gupta, Saurabh	University of Bonn
Malladi, Meher Venkata Ramakrishna	University of Bonn
Lobefaro, Luca	University of Bonn
Doisy, Guillaume	Dexory
Stachniss, Cyrill	University of Bonn
Keywords: Localization, Mapping Abstract: LiDAR odometry is essential for many robotics applications, including 3D mapping, navigation, and simultaneous localization and mapping. LiDAR odometry systems are usually based on some form of point cloud registration to compute the ego-motion of a mobile robot. Yet, few of today's LiDAR odometry systems consider domain-specific knowledge or the kinematic model of the mobile platform during the point cloud alignment. In this paper, we present Kinematic-ICP, a LiDAR odometry system that focuses on wheeled mobile robots equipped with a 3D LiDAR and moving on a planar surface, which is a common assumption for warehouses, offices, hospitals, etc. Our approach introduces kinematic constraints within the optimization of a traditional point-to-point iterative closest point scheme. In this way, the resulting motion follows the kinematic constraints of the platform, effectively exploiting the robot's wheel odometry and the 3D LiDAR observations. We dynamically adjust the influence of LiDAR measurements and wheel odometry in our optimization scheme, allowing the system to handle degenerate scenarios such as feature-poor corridors. We evaluate our approach on robots operating in large-scale warehouse environments, but also outdoors. The experiments show that our approach achieves top performances and is more accurate than wheel odometry and common LiDAR odometry systems. Kinematic-ICP has been recently deployed in the Dexory fleet of robots operating in warehouses worldwide at their customers' sites, showing that our method can run in the real world alongside a complete navigation stack.

08:40-08:45, Paper ThAT3.3	Add to My Program
GERA: Geometric Embedding for Efficient Point Registration Analysis

Li, Geng	Shandong University
Cao, Haozhi	Nanyang Technological University
Liu, Mingyang	Shandong University
Yuan, Shenghai	Nanyang Technological University
Yang, Jianfei	Nanyang Technological University
Keywords: Computer Vision for Medical Robotics, Representation Learning, Medical Robots and Systems Abstract: Point cloud registration aims to provide estimated transformations to align 3D point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces training time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliable processing.

08:45-08:50, Paper ThAT3.4	Add to My Program
KISS-Matcher: Fast and Robust Point Cloud Registration Revisited

Lim, Hyungtae	Massachusetts Institute of Technology
Kim, Daebeom	Korea Advanced Institute of Science and Technology
Shin, Gunhee	KAIST
Shi, Jingnan	Massachusetts Institute of Technology
Vizzo, Ignacio	Dexory
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Park, Jaesik	Seoul National University
Carlone, Luca	Massachusetts Institute of Technology
Keywords: Mapping, Localization, SLAM Abstract: While global point cloud registration systems have advanced significantly in all aspects, many studies have focused on specific components, such as feature extraction, graph-theoretic pruning, or pose solvers. In this paper, we take a holistic view on the registration problem and develop an open-source and versatile C++ library for point cloud registration, called textit{KISS-Matcher}. textit{KISS-Matcher} combines a novel feature detector, textit{Faster-PFH}, that improves over the classical fast point feature histogram (FPFH). Moreover, it adopts a k-core-based graph-theoretic pruning to reduce the time complexity of rejecting outlier correspondences. Finally, it combines these modules in a complete, user-friendly, and ready-to-use pipeline. As verified by extensive experiments, KISS-Matcher has superior scalability and broad applicability, achieving a substantial speed-up compared to state-of-the-art outlier-robust registration pipelines while preserving accuracy. Our code will be available at href{https://github.com/MIT-SPARK/KISS-Matcher}{texttt{ht tps://github.com/MIT-SPARK/KISS-Matcher}}.

08:50-08:55, Paper ThAT3.5	Add to My Program
SANDRO: A Robust Solver with a Splitting Strategy for Point Cloud Registration

Adlerstein, Michael	Italian Institute of Technology
Soares, João Carlos Virgolino	Istituto Italiano Di Tecnologia
Bratta, Angelo	Istituto Italiano Di Tecnologia
Semini, Claudio	Istituto Italiano Di Tecnologia
Keywords: RGB-D Perception, Mapping Abstract: Point cloud registration is a critical problem in computer vision and robotics, especially in the field of navigation. Current methods often fail when faced with high outlier rates or take a long time to converge to a suitable solution. In this work, we introduce a novel algorithm for point cloud registration called SANDRO (Splitting strategy for point cloud Alignment using Non-convex anD Robust Optimization), which combines an Iteratively Reweighted Least Squares (IRLS) framework with a robust loss function with graduated non-convexity. This approach is further enhanced by a splitting strategy designed to handle high outlier rates and skewed distributions of outliers. SANDRO is capable of addressing important limitations of existing methods, as in challenging scenarios where the presence of high outlier rates and point cloud symmetries significantly hinder convergence. SANDRO achieves superior performance in terms of success rate when compared to the state-of-the-art methods, demonstrating a 20% improvement from the current state of the art when tested on the Redwood real dataset and 60% improvement when tested on synthetic data.

08:55-09:00, Paper ThAT3.6	Add to My Program
Bridging In-Situ and Satellite Data: Enhancing Gas Concentration Estimation through Integration of Data-Driven and Physics-Based Modeling

Lu, Guoyu	University of Georgia
Keywords: RGB-D Perception, Vision-Based Navigation, Visual Tracking Abstract: Gas concentration estimation is crucial for understanding and mitigating climate change. While most research and monitoring efforts focus on major greenhouse gases such as CO2, significantly less attention has been given to trace gases like NO2, which play a critical role in atmospheric chemistry and air quality. This paper aims to enhance trace gas concentration estimation by integrating physics-based models into data-driven neural network frameworks. Furthermore, to improve large-scale estimation accuracy, we incorporate in-situ measurements to refine neural network models trained on satellite observations. The resulting model can provide reliable large-scale gas concentration estimates, particularly for locations lacking precise in-situ measurements. This approach offers a novel pathway to enhance the accuracy and applicability of gas monitoring for climate and environmental research. While NO2 serves as the target trace gas in this study, the proposed framework is potentially applicable to the prediction of other atmospheric gas concentrations.


ThAT4 Regular Session, 304	Add to My Program
Image and 3D Segmentation 1

Chair: Koppal, Sanjeev	University of Florida
Co-Chair: Matteucci, Matteo	Politecnico Di Milano

08:30-08:35, Paper ThAT4.1	Add to My Program
A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data

Deng, Wenbang	National University of Defense Technology
Chen, Xieyuanli	National University of Defense Technology
Yu, Qinghua	National University of Defense Technology
He, Yunze	Hunan University
Xiao, Junhao	National University of Defense Technology
Lu, Huimin	National University of Defense Technology
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Computer Vision for Transportation Abstract: Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at https://github.com/nubot-nudt/DOSS.

08:35-08:40, Paper ThAT4.2	Add to My Program
SAM-Guided Pseudo Label Enhancement for Multi-Modal 3D Semantic Segmentation

Yang, Mingyu	University of Michigan
Lu, Jitong	University of Michigan
Kim, Hun-Seok	University of Michigan
Keywords: Deep Learning for Visual Perception, Sensor Fusion Abstract: Multi-modal 3D semantic segmentation is vital for applications such as autonomous driving and virtual reality (VR). To effectively deploy these models in real-world scenarios, it is essential to employ cross-domain adaptation techniques that bridge the gap between training data and real-world data. Recently, self-training with pseudo-labels has emerged as a predominant method for cross-domain adaptation in multi-modal 3D semantic segmentation. However, generating reliable pseudo-labels necessitates stringent constraints, which often result in sparse pseudo-labels after pruning. This sparsity can potentially hinder performance improvement during the adaptation process. We propose an image-guided pseudo-label enhancement approach that leverages the complementary 2D prior knowledge from the Segment Anything Model (SAM) to introduce more reliable pseudo-labels, thereby boosting domain adaptation performance. Specifically, given a 3D point cloud and the SAM masks from its paired image data, we collect all 3D points covered by each SAM mask that potentially belong to the same object. Then our method refines the pseudo-labels within each SAM mask in two steps. First, we determine the class label for each mask using majority voting and employ various constraints to filter out unreliable mask labels. Next, we introduce Geometry-Aware Progressive Propagation (GAPP) which propagates the mask label to all 3D points within the SAM mask while avoiding outliers caused by 2D-3D misalignment. Experiments conducted across multiple datasets and domain adaptation scenarios demonstrate that our proposed method significantly increases the quantity of high-quality pseudo-labels and enhances the adaptation performance over baseline methods.

08:40-08:45, Paper ThAT4.3	Add to My Program
Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Jiang, Chen	University of Alberta
Wang, Allie	University of Alberta
Jagersand, Martin	University of Alberta
Keywords: Deep Learning for Visual Perception, Learning Categories and Concepts, Visual Servoing Abstract: In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU^2Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot’s visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.

08:45-08:50, Paper ThAT4.4	Add to My Program
Boosting Cross-Spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation

Kwon, SeokJun	Sejong University
Shin, Jeongmin	Sejong University
Kim, Namil	NAVER LABS
Hwang, Soonmin	Hanyang University
Choi, Yukyung	Sejong University
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Recognition Abstract: In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.

08:50-08:55, Paper ThAT4.5	Add to My Program
VideoSAM: Open-World Video Segmentation

Guo, Pinxue	Fudan University
Zhao, Zixu	Amazon Web Services
Gao, Jianxiong	Fudan University
Wu, Chongruo	UC Davis
He, Tong	Amazon.com
Zhang, Zheng	AWS
Xiao, Tianjun	AWS
Zhang, Wenqiang	Fudan University
Keywords: Recognition, Object Detection, Segmentation and Categorization, Computer Vision for Automation Abstract: Video segmentation is essential for advancing robotics and autonomous driving, particularly in open-world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM’s embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end-to-end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle-ack-Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames. Our experiments on the UVO and BURST benchmark, and also robotic videos, demonstrate VideoSAM’s effectiveness and robustness in real-world scenarios. All codes will be available.

08:55-09:00, Paper ThAT4.6	Add to My Program
Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion

Liu, Jiangyuan	University of Chinese Academy of Sciences
Ma, Hongxuan	Institute of Automation, Chinese Academy of Sciences
Guo, Yuxin	University of Chinese Academy of Sciences
Zhao, Yuhao	Institute of Automation, Chinese Academy of Sciences
Zhang, Chi	Shijiazhuang Tiedao University
Sui, Wei	Soochow University
Zou, Wei	Chinese Academy of Sciences, University of Chinese Academy of Sci
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization Abstract: Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.


ThAT5 Regular Session, 305	Add to My Program
Planinng and Control for Legged Robots 3

Chair: Qian, Feifei	University of Southern California
Co-Chair: Marchionni, Luca	Pal Robotics SL

08:30-08:35, Paper ThAT5.1	Add to My Program
Obstacle-Aided Trajectory Control of a Quadrupedal Robot through Sequential Gait Composition

Hu, Haodi	University of Southern California
Qian, Feifei	University of Southern California
Keywords: Legged Robots, Biologically-Inspired Robots, Dynamics, Rough Terrain Locomotion Abstract: Modeling and controlling legged robot locomotion on terrains with densely distributed large rocks and boulders are fundamentally challenging. Unlike traditional methods which often consider these rocks and boulders as obstacles and attempt to find a clear path to circumvent them, in this study we aim to develop methods for robots to actively utilize interaction forces with these "obstacles" for locomotion and navigation. To do so, we studied the locomotion of a quadrupedal robot as it traversed a simplified obstacle field, and discovered that with different gaits, the robot could passively converge to distinct orientations. A compositional return map explained this observed passive convergence, and enabled theoretical prediction of the steady-state orientation angles for any given quadrupedal gait. We experimentally demonstrated that with these predictions, a legged robot could effectively generate desired shape of trajectories amongst large, slippery obstacles, simply by switching between different gaits. Our study offered a novel method for robots to exploit traditionally-considered "obstacles" to achieve agile movements on challenging terrains.

08:35-08:40, Paper ThAT5.2	Add to My Program
Enhancing Navigation Efficiency of Quadruped Robots Via Leveraging Personal Transportation Platforms

Yoon, Minsung	Korea Advanced Institute of Science and Technology (KAIST)
Yoon, Sung-eui	KAIST
Keywords: Reinforcement Learning, Legged Robots Abstract: Quadruped robots face limitations in long-range navigation efficiency due to their reliance on legs. To ameliorate the limitations, we introduce a Reinforcement Learning-based Active Transporter Riding method (RL-ATR), inspired by humans' utilization of personal transporters, including Segways. The RL-ATR features a transporter riding policy and two state estimators. The policy devises adequate maneuvering strategies according to transporter-specific control dynamics, while the estimators resolve sensor ambiguities in non-inertial frames by inferring unobservable robot and transporter states. Comprehensive evaluations in simulation validate proficient command tracking abilities across various transporter-robot models and reduced energy consumption compared to legged locomotion. Moreover, we conduct ablation studies to quantify individual component contributions within the RL-ATR. This riding ability could broaden the locomotion modalities of quadruped robots, potentially expanding the operational range and efficiency.

08:40-08:45, Paper ThAT5.3	Add to My Program
Continuous Control of Diverse Skills in Quadruped Robots without Complete Expert Datasets

Tu, Jiaxin	FuDan University
Wei, Xiaoyi	Fudan University
Zhang, Yueqi	Fudan University
Hou, Taixian	FuDan University
Gao, Xiaofei	Beijing Zhitong Robot Technology Co., Ltd
Dong, Zhiyan	Fudan University
Zhai, Peng	Fudan University
ZHang, Lihua	Fudan University
Keywords: Legged Robots, Reinforcement Learning Abstract: Learning diverse skills for quadruped robots presents significant challenges, such as mastering complex transitions between different skills and handling tasks of varying difficulty. Existing imitation learning methods, while successful, rely on expensive datasets to reproduce expert behaviors. Inspired by introspective learning, we propose Progressive Adversarial Self-Imitation Skill Transition (PASIST), a novel method that eliminates the need for complete expert datasets. PASIST autonomously explores and selects high-quality trajectories based on predefined target poses instead of demonstrations, leveraging the Generative Adversarial Self-Imitation Learning (GASIL) framework. To further enhance learning, We develop a skill selection module to mitigate mode collapse by balancing the weights of skills with varying levels of difficulty. Through these methods, PASIST is able to reproduce skills corresponding to the target pose while achieving smooth and natural transitions between them. Evaluations on both simulation platforms and the Solo 8 robot confirm the effectiveness of PASIST, offering an efficient alternative to expert-driven learning.

08:45-08:50, Paper ThAT5.4	Add to My Program
PIP-Loco: A Proprioceptive Infinite Horizon Planning Framework for Quadrupedal Robot Locomotion

Shirwatkar, Aditya	Indian Institute of Science Bengaluru
Saxena, Naman	Indian Institute of Science, Bengaluru
Chandra, Kishore P	Visvesvaraya National Institute of Technology, Nagpur
Kolathaya, Shishir	Indian Institute of Science
Keywords: Legged Robots, Reinforcement Learning, Machine Learning for Robot Control Abstract: A core strength of Model Predictive Control (MPC) for quadrupedal locomotion has been its ability to enforce constraints and provide interpretability of the sequence of commands over the horizon. However, despite being able to plan, MPC struggles to scale with task complexity, often failing to achieve robust behavior on rapidly changing surfaces. On the other hand, model-free Reinforcement Learning (RL) methods have outperformed MPC on multiple terrains, showing emergent motions but inherently lack any ability to handle constraints or perform planning. To address these limitations, we propose a framework that integrates proprioceptive planning with RL, allowing for agile and safe locomotion behaviors through the horizon. Inspired by MPC, we incorporate an internal model that includes a velocity estimator and a Dreamer module. During training, the framework learns an expert policy and an internal model that are co-dependent, facilitating exploration for improved locomotion behaviors. During deployment, the Dreamer module solves an infinite-horizon MPC problem, adapting actions and velocity commands to respect the constraints. We validate the robustness of our training framework through ablation studies on internal model components and demonstrate improved robustness to training noise. Finally, we evaluate our approach across multi-terrain scenarios in both simulation and hardware.

08:50-08:55, Paper ThAT5.5	Add to My Program
Whole-Body End-Effector Pose Tracking

Portela, Tifanny	ETH
Cramariuc, Andrei	ETHZ
Mittal, Mayank	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Whole-Body Motion Planning and Control, Reinforcement Learning, Legged Robots Abstract: Combining manipulation with the mobility of legged robots is essential for a wide range of robotic applications. However, integrating an arm with a mobile base significantly increases the system’s complexity, making precise end-effector control challenging. Existing model-based approaches are often constrained by their modeling assumptions, leading to limited robustness. Meanwhile, recent Reinforcement Learning (RL) implementations restrict the arm’s workspace to be in front of the robot or track only the position to obtain decent tracking accuracy. In this work, we address these limitations by introducing a whole-body RL formulation for end-effector pose tracking in a large workspace on rough, unstructured terrains. Our proposed method involves a terrain-aware sampling strategy for the robot’s initial configuration and end-effector pose commands, as well as a game-based curriculum to extend the robot’s operating range. We validate our approach on the ANYmal quadrupedal robot with a six DoF robotic arm. Through our experiments, we show that the learned controller achieves precise command tracking over a large workspace and adapts across varying terrains such as stairs and slopes. On deployment, it achieves a pose-tracking error of 2.64 cm and 3.64◦, outperforming existing competitive baselines.

08:55-09:00, Paper ThAT5.6	Add to My Program
MoRE : Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models

Zhao, Han	Westlake University
Song, Wenxuan	Westlake University
Wang, Donglin	Westlake University
Tong, Xinyang	Westlake University
Ding, Pengxiang	Westlake University
Cheng, Xuelian	Monash University
Ge, Zongyuan	Monash University
Keywords: Perception-Action Coupling, Legged Robots, Reinforcement Learning Abstract: Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. method~integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture of experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that method~outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.


ThAT6 Regular Session, 307	Add to My Program
Perception for Human-Robot Interaction

Chair: Alami, Rachid	CNRS
Co-Chair: Fu, Di	University of Surrey

08:30-08:35, Paper ThAT6.1	Add to My Program
From Seeing to Recognising -- an Extended Self-Organizing Map for Human Postures Identification

He, Xin	Graduate School of Information, Production and System, Waseda Un
Zielinska, Teresa	Warsaw University of Technology
Dutta, Vibekananda	Warsaw University of Technology
Matsumaru, Takafumi	Waseda University
Sitnik, Robert	Warsaw University of Technology
Keywords: Human-Centered Robotics, Human-Aware Motion Planning, Human and Humanoid Motion Analysis and Synthesis Abstract: The article presents a dedicated method for recognizing human postures using classification and clustering options. The ultimate goal of the research is to recognise human actions based on posture sequences. Such a task imposes expectations on the developed method. For this purpose, a Sparse Autoencoder combined with a Self-Organized Map (SOM) is proposed. SOM is equipped with an additional layer of post-labeling or clustering. This entire structure is called the extended SOM. Two task-oriented modifications are applied to improve SOM performance -- a dedicated angular distance measure and a neighbourhood function for updating the SOM weights. The research contribution is the concept of extended SOM, which is trained using unlabeled data and classifies or clusters the human postures. The Sparse Autoencoder preserves the characteristics of the data while reducing its dimensionality. Better classification efficiency of the developed method is demonstrated compared to other representative methods. Ablation studies illustrate how the introduced modifications improve classification results. The developed method is characterised by good resolution in distinguishing postures. A discussion of the concept's usefulness is provided at the end of the article.

08:35-08:40, Paper ThAT6.2	Add to My Program
MmDEAR: MmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

Yang, Jiarui	Shanghai Jiao Tong University
Xia, Songpengcheng	Shanghai Jiao Tong University
Lai, Zengyuan	Shanghai Jiao Tong University
Sun, Lan	Shanghai Jiao Tong University
Wu, Qi	Shanghai Jiao Tong University
Yu, Wenxian	Shanghai Jiao Tong University
Pei, Ling	Shanghai Jiao Tong University
Keywords: Human Detection and Tracking, Human and Humanoid Motion Analysis and Synthesis Abstract: Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mmWave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mmWave point clouds and improves human body reconstruction accuracy. Our method includes a mmWave point cloud enhancement module that densifies the raw data by leveraging temporal features and a multi-stage completion network, followed by a 2D-3D fusion module that extracts both 2D and 3D motion features to refine SMPL parameters. The mmWave point cloud enhancement module learns the detailed shape and posture information from 2D human masks in single-view images. However, image-based supervision is involved only during the training phase, and the inference relies solely on sparse point clouds to maintain privacy. Experiments on multiple datasets demonstrate that our approach outperforms state-of-the-art methods, with the enhanced point clouds further improving performance when integrated into existing models.

08:40-08:45, Paper ThAT6.3	Add to My Program
Human Activity Recognition by Using Enhanced Radar Point Cloud 2D Histograms and Doppler Feature Fusion

Liao, Guanghang	Great Bay University
Ma, Jieming	Harbin Institute of Technology, Shenzhen
Luo, Fei	Great Bay University
Keywords: Human-Centered Robotics, Gesture, Posture and Facial Expressions, Multi-Modal Perception for HRI Abstract: Human activity recognition (HAR) based on millimeter wave (mmWave) radar has recently attracted significant interest due to its diverse applications in intelligent robots and human-computer interaction (HCI), including the healthcare monitoring robot. 2-dimensional (2D) histogram features of radar point clouds have demonstrated high accuracy in HAR. But further expansion and refinement of this technique is needed. This paper presents a new precise non-invasive HAR framework based on radar point cloud 2D histograms. Our method enhances conventional 2D histograms by integrating fixed radar sensing boundaries into the histograms, which shows the relative spatial position changes of the target points detected by radar. Additionally, we have concatenated Doppler features (i.e., range-Doppler and angle-Doppler histograms) with the point cloud histograms, resulting in a more comprehensive feature representation than conventional point cloud histograms. We investigated the overfitting issue in stacked hybrid networks and established a multi-layer hybrid network with an optimal number of stacked layers for HAR. In the evaluation, our approach achieves state-of-the-art accuracy, with 99.72% on mmWaveRadarWalking dataset and 98.67% on CI4R-Human-Activity-Recognition dataset, respectively. The proposed method can be applied in the fields of robotics and HCI.

08:45-08:50, Paper ThAT6.4	Add to My Program
Estimating User Engagement in Human Robot Interaction Using a Dynamic Bayesian Network

Hei, Xiaoxuan	ENSTA Paris, Institut Polytechnique De Paris
Zhang, Heng	ENSTA Paris, Institut Polytechnique De Paris
Tapus, Adriana	ENSTA Paris, Institut Polytechnique De Paris
Keywords: Multi-Modal Perception for HRI, Robot Companions, Social HRI Abstract: Engagement is a key concept in Human-Robot Interaction (HRI), as high engagement often leads to improved user experience and task performance. However, accurately estimating engagement during interactions is challenging. In this study, we propose a Dynamic Bayesian Network (DBN) to infer user engagement from various modalities, including head rotation, eye movements, facial expressions captured through visual sensors, as well as facial temperature variations measured by a thermal camera. Data was gathered from a human-robot interaction (HRI) experiment, where a robot guided participants and encouraged them to share their thoughts and insights on environmental issues. Our approach successfully combines these diverse features to offer a thorough assessment of user engagement. The network was tested on its capacity to classify participants as either engaged or not engaged, achieving an accuracy of 0.83 and an Area Under the Curve (AUC) of 0.82. These findings underscore the strength of our DBN in detecting user engagement during interactions.

08:50-08:55, Paper ThAT6.5	Add to My Program
HRI-Free: Cognitive Robotic Simulation for Evaluating Embodied Social Attention Models

Abawi, Fares	Universität Hamburg
Fu, Di	University of Surrey
Keywords: Cognitive Modeling, Embodied Cognitive Science, Social HRI Abstract: Scaling social robot studies is constrained due to the need for human interaction, making large participant recruitment impractical. Robotics simulators help mitigate this limitation but generally lack the realism to accurately simulate social cues. We introduce a cognitive robotic simulation scheme to evaluate social attention models in physical environments. By projecting ground-truth priority maps to a simulated environment, we can directly compare predicted maps using common saliency metrics. Using the iCub robot, we assess a dynamic scanpath model that predicts attention targets, simulating human scanpaths. Evaluations with the FindWho and MVVA datasets show strong correlations between robot-captured metrics and direct-streamed video metrics. Our results indicate robustness of the social attention model to noise and real-world conditions, suggesting its practical usability for predicting personalized scanpaths in real settings. This approach reduces the need for extensive human-robot interaction studies in the early stages of study design, enabling the scalability and reproducibility of social robot evaluations.

08:55-09:00, Paper ThAT6.6	Add to My Program
An EEG Conformer Model for Error Feedback During Human-Robot Interaction

Han, Jinpei	Imperial College London
Li, Yinxuan	Imperial College London
Gu, Xiao	University of Oxford
Faisal, Aldo	Imperial College London
Keywords: Brain-Machine Interfaces, Human Factors and Human-in-the-Loop, Intention Recognition Abstract: Identifying a brain signal that enables the detection of incorrect execution in human-robot interaction (HRI) is considered a holy grail for real-time systems. A major challenge in achieving this is the inherent imbalance caused by the sparsity of error-related potential (ErrP) events in streaming electroencephalogram (EEG) data, which often leads models to learn irrelevant features and perform poorly. Thus, while Deep learning-based ErrP detection has seen considerable advancements, the variability in individual user reaction times introduces labelling errors, complicating model adaptation to new subjects. Moreover, most deep learning methods are developed and validated on discrete, offline experiments using pre-defined windows, which fail to translate effectively to continuous, real-time HRI. Addressing these challenges is crucial to improving the robustness and adaptability of real-time ErrP detection in practical HRI applications. Here, we develop a causal EEG conformer framework, combining a Convolutional neural network (CNN) encoder and a transformer with causal attention for real-time prediction of ErrP signals during HRI. We evaluated our ErrP model in a pseudo-online environment in both inter-session and inter-subject cross-validation settings for exoskeleton assistive robotics. Our model demonstrated superior performance in decoding accuracy and efficiency, showcasing better generalization for real-world dynamic HRI applications.


ThAT7 Regular Session, 309	Add to My Program
Marine Robotics 5

Chair: Kelasidi, Eleni	NTNU
Co-Chair: Chavez, Jalil	Purdue

08:30-08:35, Paper ThAT7.1	Add to My Program
Cross-Platform Learning-Based Fault Tolerant Surfacing Controller for Underwater Robots

Hamamatsu, Yuya	The University of Tokyo
Remmas, Walid	Tallinn University of Technology / Université De Montpellier
Rebane, Jaan	Tallinna Tehnikaülikool
Kruusmaa, Maarja	Tallinn University of Technology (TalTech)
Ristolainen, Asko	Tallinn University of Technology
Keywords: Marine Robotics, Field Robots, Model Learning for Control Abstract: In this paper, we propose a novel cross-platform fault-tolerant surfacing controller for underwater robots, based on reinforcement learning (RL). Unlike conventional approaches, which require explicit identification of malfunctioning actuators, our method allows the robot to surface using only the remaining operational actuators without needing to pinpoint the failures. The proposed controller learns a robust policy capable of handling diverse failure scenarios across different actuator configurations. Moreover, we introduce a transfer learning mechanism that shares a part of the control policy across various underwater robots with different actuators, thus improving learning efficiency and generalization across platforms. To validate our approach, we conduct simulations on three different types of underwater robots: a hovering-type AUV, a torpedo shaped AUV, and a turtle-shaped robot (U-CAT). Additionally, real-world experiments are performed, successfully transferring the learned policy from simulation to a physical U-CAT in a controlled environment. Our RL-based controller demonstrates superior performance in terms of stability and success rate compared to a baseline controller, achieving an 85.7 percent success rate in real-world tests compared to 57.1 percent with a baseline controller. This research provides a scalable and efficient solution for fault-tolerant control for diverse underwater platforms, with potential applications in real-world aquatic missions.

08:35-08:40, Paper ThAT7.2	Add to My Program
Optimizing Underwater Robot Navigation: A Study of DRL Algorithms and Multi-Modal Sensor Fusion

Deowan, Md Ether	University of Toulon
Yousha, Md Shamin Yeasher	Norwegian University of Science and Technology - NTNU
Hossain, Tihan Mahmud	Norwegian University of Science and Technology - NTNU
Hassan, Shahriar	Instituto Superior Técnico
Marxer, Ricard	Université De Toulon, Aix Marseille Univ, CNRS, LIS
Keywords: Marine Robotics, Autonomous Agents, Reinforcement Learning Abstract: Autonomous underwater navigation faces significant challenges due to the complexity of the environment, limited localization methods, and poor visibility. This paper investigates the performance of various reinforcement learning (RL) algorithms—Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO), Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), and Advantage Actor-Critic (A2C)—to improve navigation capabilities of low-cost underwater robots equipped with multi-modal sensors. Advanced depth estimation models such as MiDaS and Depth Anything, combined with domain randomization techniques, are employed to enhance the system's robustness and generalization across varying underwater conditions. The proposed approach integrates real-time sensor data and historical actions to enable 3D maneuvering in simulated environments, leading to significant improvements in sensor fusion, depth perception, and obstacle avoidance. Simulation results demonstrate that the combination of RL techniques with sensor fusion considerably improves mapless autonomous underwater exploration, providing a robust solution for navigating unstructured aquatic environments.

08:40-08:45, Paper ThAT7.3	Add to My Program
PUGS: Perceptual Uncertainty for Grasp Selection in Underwater Environments

Bagoren, Onur	University of Michigan
Micatka, Marc	University of Washington
Skinner, Katherine	University of Michigan
Marburg, Aaron	University of Washington
Keywords: Marine Robotics, Perception for Grasping and Manipulation Abstract: When navigating and interacting in challenging environments where sensory information is imperfect and incomplete, robots must make decisions that account for these shortcomings. We propose a novel method for quantifying and representing such perceptual uncertainty in 3D reconstruction through occupancy uncertainty estimation. We develop a framework to incorporate it into grasp selection for autonomous manipulation in underwater environments. Instead of treating each measurement equally when deciding which location to grasp from, we present a framework that propagates uncertainty inherent in the multi-view reconstruction process into the grasp selection. We evaluate our method with both simulated and the real world data, showing that by accounting for uncertainty, the grasp selection becomes robust against partial and noisy measurements. Code will be made available at https://onurbagoren.github.io/PUGS/

08:45-08:50, Paper ThAT7.4	Add to My Program
Learning to Swim: Reinforcement Learning for 6-DOF Control of Thruster-Driven Autonomous Underwater Vehicles

Cai, Levi	Massachusetts Institute of Technology
Chang, Kevin	Oregon State University
Girdhar, Yogesh	Woods Hole Oceanographic Institution
Keywords: Field Robots, Marine Robotics, Reinforcement Learning Abstract: Controlling AUVs can be challenging because of the effect of complex non-linear hydrodynamic forces acting on the robot, which are significant in water and cannot be ignored. The problem is exacerbated for small AUVs for which the dynamics can change significantly with payload changes and deployments under different hydrodynamic conditions. The common approach to AUV control is a combination of passive stabilization with added buoyancy on top and weights on the bottom, and a PID controller tuned for simple and smooth motion primitives. However, the approach comes at the cost of sluggish controls and often the need to re-tune controllers with configuration changes. In this paper, we propose a fast (trainable in minutes), reinforcement learning-based approach for full 6 degree of freedom (DOF) control of a thruster-driven AUVs, taking 6-DOF command-conditioned inputs direct to thruster outputs. We present a new, highly parallelized simulator for underwater vehicle dynamics. We demonstrate this approach through zero-shot sim-to-real (with no tuning) transfer onto a real AUV that produces comparable results to hand-tuned PID controllers. Furthermore, we show that domain randomization on the simulator produces policies that are robust to small variations in vehicle's physical parameters.

08:50-08:55, Paper ThAT7.5	Add to My Program
Underwater Motions Analysis and Control of a Coupling-Tiltable Unmanned Aerial-Aquatic Vehicle

Huang, Dongyue	The Chinese University of Hong Kong
Dou, Minghao	The Chinese University of Hong Kong
Liu, Xuchen	The Chinese University of Hong Kong
Sun, Tao	Shenzhen Institute of Artificial Intelligence and Robotics for S
Zhang, Jianguo	Shenzhen Institute of Artificial Intelligence and Robotics for S
Ding, Ning	The Chinese University of Hong Kong, Shenzhen
Chen, Xinlei	Tsinghua University
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Marine Robotics, Aerial Systems: Mechanics and Control, Motion Control Abstract: Coupling-Tiltable Unmanned Aerial-Aquatic Vehicles (UAAVs) have gained increasing importance, yet lack comprehensive analysis and suitable controllers. This paper analyzes the underwater motion characteristics of a self-designed UAAV, Mirs-Alioth, and designs a controller for it. The effectiveness of the controller is validated through experiments. The singularities of Mirs-Alioth are derived as Singular Thrust Tilt Angle (STTA), which serve as an essential tool for an analysis of its underwater motion characteristics. The analysis reveals several key factors for designing the controller. These include the need for logic switching, using a Nussbaum function to compensate control direction uncertainty in the auxiliary channel, and employing an auxiliary controller to mitigate coupling effects. Based on these key points, a control scheme is designed. It consists of a controller that regulates the thrust tilt angle to the singular value, an auxiliary controller incorporating a Saturated Nussbaum function, and a logic switch. Eventually, two sets of experiments are conducted to validate the effectiveness of the controller and demonstrate the necessity of the Nussbaum function.

08:55-09:00, Paper ThAT7.6	Add to My Program
Adaptive Integral Sliding Mode Control for Attitude Tracking of Underwater Robots with Large Range Pitch Variations in Confined Spaces

Wang, Xiaorui	Peking University
Sha, Zeyu	Peking University
Zhang, Feitian	Peking University
Keywords: Marine Robotics, Motion Control, Robust/Adaptive Control Abstract: 水下机器人在探索水生环境中发挥着至关重要的作用。灵活调整姿态的能力，尤其是俯仰，对于水下机器人在狭窄空间内有效完成任务至关重要。然而，由姿态变化导致的高度耦合的六自由度动力学和有限空间区域内的复杂湍流带来了重大挑战。为了解决水下机器人的姿态控制问题，本文研究了站位保持期间的大范围俯仰角跟踪以及同步滚转和偏航角控制，以实现多功能姿态调整。基于动态建模，本文提出了一种自适应积分滑模控制器（AISMC），该控制器将积分模块集成到传统的滑模控制（SMC）中，并自适应地调整开关增益，以提高跟踪精度、减少颤振并增强鲁棒ö


ThAT8 Regular Session, 311	Add to My Program
Aerial Robots: Learning 1

Chair: Yim, Mark	University of Pennsylvania
Co-Chair: Jagannatha Sanket, Nitin	Worcester Polytechnic Institute

08:30-08:35, Paper ThAT8.1	Add to My Program
Learning Local Urban Wind Flow Fields from Range Sensing

Folk, Spencer	University of Pennsylvania
Melton, John	NASA Ames Research Center
Margolis, Benjamin W. L.	NASA Ames Research Center
Yim, Mark	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Aerial Systems: Perception and Autonomy, Deep Learning Methods, Automation Technologies for Smart Cities Abstract: Obtaining accurate and timely predictions of the wind through an urban environment is a challenging task, but has wide-ranging implications for the safety and efficiency of autonomous aerial vehicles in future urban airspaces. Prior work relies strongly on global information about the environment, such as a precise map of the city and in-situ wind measurements at various locations, to run expensive computational fluid dynamics solvers to predict the entire wind flow field. In contrast, this paper introduces a new method to estimate the wind flow field in a region around the robot in real time, utilizing on-board range measurements to sense nearby buildings and sparse wind measurements to infer windspeed and direction. We propose that this information sufficiently characterizes the structure of the wind flow field in the local region of interest. To that end, we introduce a deep learning-based approach to predict local flow fields from range measurements. Our results indicate that a neural network trained on numerous simulated winds through small randomized maps is capable of reconstructing local wind flows while generalizing to larger environments with over 200 buildings. This contribution empowers computationally-constrained aerial robots to reason about the structure of local wind flow fields, thereby enabling new planning, control, and estimation strategies in windy urban environments without textit{a priori} knowledge of the map.

08:35-08:40, Paper ThAT8.2	Add to My Program
Whole-Body Control through Narrow Gaps from Pixels to Action

Wu, Tianyue	Zhejiang University
Chen, Yeke	Zhejiang University
Chen, Tianyang	Zhejiang University
Zhao, Guangyu	Zhejiang University
Gao, Fei	Zhejiang University
Keywords: Aerial Systems: Applications, Sensorimotor Learning, Reinforcement Learning Abstract: Flying through body-size narrow gaps in the environment is one of the most challenging moments for an underactuated multirotor. We explore a purely data-driven method to master this flight skill in simulation, where a neural network directly maps pixels and proprioception to continuous low-level control commands. This learned policy enables whole-body control through gaps with different geometries demanding sharp attitude changes (e.g., near-vertical roll angle). The policy is achieved by successive model-free reinforcement learning (RL) and online observation space distillation. The RL policy receives (virtual) point clouds of the gaps' edges for scalable simulation and is then distilled into the high-dimensional pixel space. However, this flight skill is fundamentally expensive to learn by exploring in RL due to restricted feasible solution space. We propose to reset the agent as states on the trajectories by a model-based trajectory optimizer to alleviate this problem. The presented training pipeline is compared with baseline methods, and ablation studies are conducted to identify the key ingredients of the method. The immediate next step is to scale up the variation of gap sizes and geometries in anticipation of emergent policies and demonstrate the sim-to-real transformation.

08:40-08:45, Paper ThAT8.3	Add to My Program
VisFly: An Efficient and Versatile Simulator for Training Vision-Based Flight

Li, Fanxing	Shanghai Jiao Tong University
Sun, Fangyu	Shanghai Jiaotong University
Zhang, Tianbao	Shanghai Jiao Tong University
Zou, Danping	Shanghai Jiao Ton University
Keywords: Aerial Systems: Perception and Autonomy, Simulation and Animation, Visual Learning Abstract: We present VisFly, a quadrotor simulator designed to efficiently train vision-based flight policies using reinforcement learning algorithms. VisFly offers a user-friendly framework and interfaces, leveraging Habitat-Sim's rendering engines to achieve frame rates exceeding 10,000 frames per second for rendering motion and sensor data. The simulator incorporates differentiable physics and is seamlessly wrapped with the Gym environment, facilitating the straightforward implementation of various learning algorithms. It supports the directly importing open-source scene datasets compatible with Habitat-Sim, enabling training on diverse real-world environments simultaneously. To validate our simulator, we also make three reinforcement learning examples for typical flight tasks relying on visual observations. The simulator is now available at [https://github.com/SJTU-ViSYS-team/VisFly].

08:45-08:50, Paper ThAT8.4	Add to My Program
Environment As Policy: Learning to Race in Unseen Tracks

Wang, Hongze	ETH Zurich
Xing, Jiaxu	University of Zurich
Messikommer, Nico	University of Zurich
Scaramuzza, Davide	University of Zurich
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, AI-Enabled Robotics Abstract: Reinforcement learning (RL) has achieved outstanding success in complex robot control tasks, such as drone racing, where the RL agents have outperformed human champions in a known racing track. However, these agents fail in unseen track configurations, always requiring complete retraining when presented with new track layouts. This work aims to develop RL agents that generalize effectively to novel track configurations without retraining. The naive solution of training directly on a diverse set of track layouts can overburden the agent, resulting in suboptimal policy learning as the increased complexity of the environment impairs the agent’s ability to learn to fly. To enhance the generalizability of the RL agent, we propose an adaptive environment-shaping framework that dynamically adjusts the training environment based on the agent’s performance. We achieve this by leveraging a secondary RL policy to design environments that strike a balance between being challenging and achievable, allowing the agent to adapt and improve progressively. Using our adaptive environment shaping, one single racing policy efficiently learns to race in diverse and challenging tracks. Experimental results validated in both simulation and the real world show that our method enables drones to successfully fly complex and unseen race tracks, outperforming existing environment-shaping techniques.

08:50-08:55, Paper ThAT8.5	Add to My Program
UAV-Assisted Self-Supervised Terrain Awareness for Off-Road Navigation

Fortin, Jean-Michel	Université Laval
Gamache, Olivier	Université Laval
Fecteau, William	Université Laval
Daum, Effie	Université Laval
Larrivée-Hardy, William	Laval University
Pomerleau, Francois	Université Laval
Giguère, Philippe	Université Laval
Keywords: Field Robots, Learning from Experience, Multi-Robot Systems Abstract: Terrain awareness is an essential milestone to enable truly autonomous off-road navigation. Accurately predicting terrain characteristics allows optimizing a vehicle's path against potential hazards. Recent methods use deep neural networks to predict traversability-related terrain properties in a self-supervised manner, relying on proprioception as a training signal. However, onboard cameras are inherently limited by their point-of-view relative to the ground, suffering from occlusions and vanishing pixel density with distance. This paper introduces a novel approach for self-supervised terrain characterization using an aerial perspective from a hovering drone. We capture terrain-aligned images while sampling the environment with a ground vehicle, effectively training a simple predictor for vibrations, bumpiness, and energy consumption. Our dataset includes 2.8 km of off-road data collected in forest environment, comprising 13 484 ground-based images and 12 935 aerial images. Our findings show that drone imagery improves terrain property prediction by 21.37 % on the whole dataset and 37.35 % in high vegetation, compared to ground images. We conduct ablation studies to identify the main causes of these performance improvements. We also demonstrate the real-world applicability of our approach by scouting an unseen area with a drone, planning and executing an optimized path on the ground.

08:55-09:00, Paper ThAT8.6	Add to My Program
EdgeFlowNet: 100FPS@1W Dense Optical Flow for Tiny Mobile Robots

Pinnama Raju, Sai Ramana Kiran	Worcester Polytechnic Institute
Singh, Rishabh	Worcester Polytechnic Institute
Velmurugan, Manoj	Worcester Polytechnic Institute
Jagannatha Sanket, Nitin	Worcester Polytechnic Institute
Keywords: Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception, Vision-Based Navigation Abstract: Optical flow estimation is a critical task for tiny mobile robotics to enable safe and accurate navigation, obstacle avoidance, and other functionalities. However, optical flow estimation on tiny robots is challenging due to limited onboard sensing and computation capabilities. In this paper, we propose EdgeFlowNet, a high-speed, low-latency dense optical flow approach for tiny autonomous mobile robots by harnessing the power of edge computing. We demonstrate the efficacy of our approach by deploying EdgeFlowNet on a tiny quadrotor to perform static obstacle avoidance, flight through unknown gaps and dynamic obstacle dodging. EdgeFlowNet is about 20X faster than the previous state-of-the-art approaches while improving accuracy by over 20% and using only 1.08W of power enabling advanced autonomy on palm-sized tiny mobile robots.


ThAT9 Regular Session, 312	Add to My Program
Multi-Robot Formation Control

Chair: Agarwal, Saurav	University of Pennsylvania
Co-Chair: Parasuraman, Ramviyas	University of Georgia

08:30-08:35, Paper ThAT9.1	Add to My Program
GMF: Gravitational Mass-Force Framework for Parametric Multi-Level Coordination in Multi-Robot and Swarm Robotic Systems

Starks, Michael	University of Georgia Heterogeneous Robotics Research Lab
Parasuraman, Ramviyas	University of Georgia
Keywords: Multi-Robot Systems, Swarm Robotics, Cooperating Robots Abstract: Distributed multi-robot coordination is critical to achieving reliable robotic missions that exploit the collective capability of swarm robots. In particular, the consensus and formation control problems have been extensively studied, resulting in distributed controllers that enable robots to rely only on information from themselves and their immediate neighbors. However, these algorithms are usually designed for specific objectives (e.g., cooperative object transportation, environmental coverage, etc.), requiring the controllers to be re-designed for domain variations. Therefore, we propose a new parametric framework inspired by gravitational fields that allow simultaneous coordination of robots at multiple levels, enabling generalization and domain adaptation. Our approach is built on top of a connectivity-preserving formation controller, with need-based and task-based ad hoc coordination at private, local, and global layers of a swarm robot team. We demonstrate the remarkable potential of our framework through extensive simulations and real-world swarm robot experiments in three representative multi-robot tasks involving tight coordination: 1) robot-initiated rendezvous at different coordination layers, 2) coordinated boundary tracking and coverage of environmental processes, and 3) accommodating task executions and motion control while satisfying the coordination laws.

08:35-08:40, Paper ThAT9.2	Add to My Program
Leader-Follower Formation Control of Perturbed Nonholonomic Agents Along Parametric Curves with Directed Communication

Zhang, Bin	The Hong Kong Polytechnic University
Shao, Xiaodong	Beihang University
Zhi, Hui	The Hong Kong Polytechnic University
Qiu, Liuming	The Hong Kong Polytechnic University
Romero Velazquez, Jose Guadalupe	ITAM
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Multi-Robot Systems, Motion Control, Nonholonomic Motion Planning Abstract: In this letter, we propose a novel formation controller for nonholonomic agents to form general parametric curves. First, we derive a unified parametric representation for both open and closed curves. Then, a leader-follower formation controller is designed to drive agents to form the desired parametric curves using the curve coefficients as feedbacks. We consider directed communications and constant input disturbances rejection in the controller design. Rigorous Lyapunov-based stability analysis proves the asymptotic stability of the proposed controller. The convergence of the orientations of agents to some constant values is also guaranteed. The method has the potential to be extended to deal with various real-world applications, such as object enclosing. Detailed numerical simulations and experimental studies are conducted to verify the performance of the proposed method.

08:40-08:45, Paper ThAT9.3	Add to My Program
Versatile Distributed Maneuvering with Generalized Formations Using Guiding Vector Fields

Lu, Yang	National University of Defense Technology
Luo, Sha	University of Groningen
Zhu, Pengming	National University of Defense Technology
Yao, Weijia	Hunan University
Garcia de Marina, Hector	Universidad De Granada
Zhang, Xinglong	National University of Defense Technology
Xu, Xin	National University of Defense Technology
Keywords: Multi-Robot Systems, Motion Control, Distributed Robot Systems Abstract: This paper presents a unified approach to realize versatile distributed maneuvering with generalized formations. Specifically, we decompose the robots' maneuvers into two independent components, i.e., interception and enclosing, which are parameterized by two independent virtual coordinates. Treating these two virtual coordinates as dimensions of an abstract manifold, we derive the corresponding singularity-free guiding vector field (GVF), which, along with a distributed coordination mechanism based on the consensus theory, guides robots to achieve various motions (i.e., versatile maneuvering), including (a) formation tracking, (b) target enclosing, and (c) circumnavigation. Additional motion parameters can generate more complex cooperative robot motions. Based on GVFs, we design a controller for a nonholonomic robot model. Besides the theoretical results, extensive simulations and experiments are performed to validate the effectiveness of the approach.

08:45-08:50, Paper ThAT9.4	Add to My Program
Cooperative Distributed Model Predictive Control for Embedded Systems: Experiments with Hovercraft Formations

Stomberg, Gösta	Hamburg University of Technology
Schwan, Roland	EPFL
Grillo, Andrea	EPFL
Jones, Colin	École Polytechnique Fédérale De Lausanne (EPFL)
Faulwasser, Timm	Hamburg University of Technology
Keywords: Multi-Robot Systems, Optimization and Optimal Control, Cooperating Robots Abstract: This paper presents experiments for embedded cooperative distributed model predictive control applied to a team of hovercraft floating on an air hockey table. The hovercraft collectively solve a centralized optimal control problem in each sampling step via a stabilizing decentralized real-time iteration scheme using the alternating direction method of multipliers. The efficient implementation does not require a central coordinator, executes onboard the hovercraft, and facilitates sampling intervals in the millisecond range. The formation control experiments showcase the flexibility of the approach on scenarios with point-to-point transitions, trajectory tracking, collision avoidance, and moving obstacles.

08:50-08:55, Paper ThAT9.5	Add to My Program
Coordinated Multi-Robot Navigation with Formation Adaptation

Deng, Zihao	University of Massachusetts Amherst
Gao, Peng	North Carolina State University
Jose, Williard Joshua	University of Massachusetts Amherst
Reardon, Christopher M.	MITRE
Wigness, Maggie	U.S. Army Research Laboratory
Rogers III, John G.	US Army Research Laboratory
Zhang, Hao	University of Massachusetts Amherst
Keywords: Multi-Robot Systems, Machine Learning for Robot Control Abstract: Coordinated multi-robot navigation is an essential ability for a team of robots operating in diverse environments. Robot teams often need to maintain specific formations, such as wedge formations, to enhance visibility, positioning, and efficiency during fast movement. However, complex environments such as narrow corridors challenge rigid team formations, which makes effective formation control difficult in real-world environments. To address this challenge, we introduce a novel Adaptive Formation with Oscillation Reduction (AFOR) approach to improve coordinated multi-robot navigation. We develop AFOR under the theoretical framework of hierarchical learning and integrate a spring-damper model with hierarchical learning to enable both team coordination and individual robot control. At the upper level, a graph neural network facilitates formation adaptation and information sharing among the robots. At the lower level, reinforcement learning enables each robot to navigate and avoid obstacles while maintaining the formations. We conducted extensive experiments using Gazebo in the Robot Operating System (ROS), a high-fidelity Unity3D simulator with ROS, and real robot teams. Results demonstrate that AFOR enables smooth navigation with formation adaptation in complex scenarios and outperforms previous methods. More details of this work are provided on the project website: https://hcrlab.gitlab.io/project/afor.


ThAT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 3

Chair: Guo, Jia	Cornell University
Co-Chair: Kim, Woojun	Carnegie Mellon University

08:30-08:35, Paper ThAT10.1	Add to My Program
MARVEL: Multi-Agent Reinforcement Learning for Constrained Field-Of-View Multi-Robot Exploration in Large-Scale Environments

Chiun, Jimmy	National University of Singapore
Zhang, Shizhe	National University of Singapore
Wang, Yizhuo	National University of Singapore
Cao, Yuhong	National University of Singapore
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Keywords: Multi-Robot Systems, Reinforcement Learning, Motion and Path Planning Abstract: In multi-robot exploration, a team of mobile robot is tasked with efficiently mapping an unknown environments. While most exploration planners assume omnidirectional sensors like LiDAR, this is impractical for small robots such as drones, where lightweight, directional sensors like cameras may be the only option due to payload constraints. These sensors have a constrained field-of-view (FoV), which adds complexity to the exploration problem, requiring not only optimal robot positioning but also sensor orientation during movement. In this work, we propose MARVEL, a neural framework that leverages graph attention networks, together with novel frontiers and orientation features fusion technique, to develop a collaborative, decentralized policy using multi-agent reinforcement learning (MARL) for robots with constrained FoV. To handle the large action space of viewpoints planning, we further introduce a novel information-driven action pruning strategy. MARVEL improves multi-robot coordination and decision-making in challenging large-scale indoor environments, while adapting to various team sizes and sensor configurations (i.e., FoV and sensor range) without additional training. Our extensive evaluation shows that MARVEL’s learned policies exhibit effective coordinated behaviors, outperforming state-of-the-art exploration planners across multiple metrics. We experimentally demonstrate MARVEL’s generalizability in large-scale environments, of up to 90m by 90m, and validate its practical applicability through successful deployment on a team of real drone hardware.

08:35-08:40, Paper ThAT10.2	Add to My Program
RACE: A Fast and Lightweight Urban Exploration and Search Strategy for Multi-Robot Systems

Leong, Jabez Kit	Singapore University of Technology and Design
Soh, Gim Song	Singapore University of Technology and Design
Keywords: Multi-Robot Systems, Search and Rescue Robots, Swarm Robotics Abstract: Multi-Robot Systems (MRS) are increasingly deployed for hazardous tasks in urban environments. Among many tasks, search and rescue remains challenging as it deals with exploration in an unknown indoor constrained environment. For example, without global knowledge of the map of a building floor, it is not advantageous to choose one path over another at a corridor junction. Also, if the assigned frontiers are far from the robot, backtracking along a corridor will cost more than moving forward. Since exploration along corridors is similar to solving a maze, this paper examines classical maze-solving algorithms that are known to be computationally fast and lightweight, such as the Right Hand Rule (RHR), Random Mouse (RM), and more. The authors have identified two gaps that need to be addressed before these algorithms can be applied to physical MRS. Firstly, these algorithms are not designed for the cooperation of multiple agents in exploration. Secondly, they are often applied to only a low-fidelity simulation environment, which requires some work to make these algorithms transferable to work in the commonly used occupancy grid map environment. In this paper, the authors introduced RACE, a fast and lightweight collective urban exploration and search algorithm based on a modified and condensed version of the Ant Colony Optimization (ACO) algorithm. The proposed solution is successfully verified in a low-fidelity simulation, evaluated against other exploration and search algorithms like RHR and RM. An innovative approach of RACE Simulation to Physical implementation is presented and a physical system evaluation is performed to evaluate RACE against a Rapidly-Exploring Random Tree algorithm. Finally, the proposed solution is further verified with a physical experiment, which a quadrupedal robot is assigned to explore part of a floor of SUTD, spanning approximately (55m x 40m). RACE also showed potential in handling challenging close-loop and dead-end environments.

08:40-08:45, Paper ThAT10.3	Add to My Program
Reinforcement Learning Driven Multi-Robot Exploration Via Explicit Communication and Density-Based Frontier Search

Calzolari, Gabriele	Luleå Tekniska Universitet
Sumathy, Vidya	Luleå University of Technology
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Reinforcement Learning, Multi-Robot Systems, Cooperating Robots Abstract: Collaborative multi-agent exploration of unknown environments is crucial for search and rescue operations. Effective real-world deployment must address challenges such as limited inter-agent communication and static and dynamic obstacles. This paper introduces a novel decentralized collaborative framework based on Reinforcement Learning to enhance multi-agent exploration in unknown environments. Our approach enables agents to decide their next action using an agent-centered field-of-view occupancy grid, and features extracted from A* algorithm-based trajectories to frontiers in the reconstructed global map. Furthermore, we propose a constrained communication scheme that enables agents to share their environmental knowledge efficiently, minimizing exploration redundancy. The decentralized nature of our framework ensures that each agent operates autonomously, while contributing to a collective exploration mission. Extensive simulations in Gymnasium and real-world experiments demonstrate the robustness and effectiveness of our system, while all the results highlight the benefits of combining autonomous exploration with inter-agent map sharing, advancing the development of scalable and resilient robotic exploration systems.

08:45-08:50, Paper ThAT10.4	Add to My Program
Integrating Multi-Robot Adaptive Sampling and Informative Path Planning for Spatiotemporal Natural Environment Prediction

Kailas, Siva	Georgia Institute of Technology
Deolasee, Srujan	Carnegie Mellon University
Luo, Wenhao	University of Illinois Chicago
Kim, Woojun	Carnegie Mellon University
Sycara, Katia	Carnegie Mellon University
Keywords: Path Planning for Multiple Mobile Robots or Agents Abstract: Learning to predict spatiotemporal (ST) environmental processes from a sparse set of samples collected autonomously is a difficult task from both a sampling perspective (collecting the best sparse samples) and from a learning perspective (predicting the next timestep). In this work, we focus on investigating the sample collection process via multi-robot informative path planning. We present an approach for incorporating multi-robot informative path planning into a spatiotemporal adaptive sampling framework while considering path length constraints for sampling location selection. We also incorporate informative path planning to determine the best path to collect samples along while en route to collecting the desired sample. We achieve this in a decentralized manner by decoupling the process into two stages: the first stage uses our spatiotemporal mixture of Gaussian Processes (STMGP) model to determine the most informative sampling location via a mutual information lower bound heuristic and the second stage plans an informative path to collect the desired sample and other additional informative samples via submodular function optimization. Moreover, we effectively leverage peer-to-peer communication to enable coordination. Simulation results on real-world spatiotemporal data are provided to validate the effectiveness of our proposed approach.

08:50-08:55, Paper ThAT10.5	Add to My Program
D-PBS: Dueling Priority-Based Search for Multiple Nonholonomic Robots Motion Planning in Congested Environments

Zhang, Xiaotong	Chinese Academy of Sciences
Xiong, Gang	Institute of Automation, Chinese Academy of Sciences
Wang, Yuanjing	Durham University
Teng, Siyu	HKBU; UIC
Chen, Long	Chinese Academy of Sciences
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Nonholonomic Motion Planning Abstract: This letter focuses on the multiple nonholonomic robots motion planning (MRMP) problem in congested and complex environments, where the complexity escalates dramatically with the increase in the number of robots, frequently leading to deadlocks. We present the Dueling Priority-Based Search (D-PBS), an efficient and scalable priority-based motion planner for multiple nonholonomic car-like robots, capable of enabling robots to move safely to destinations in spatially-constrained settings. We achieve this by adopting the alternate dueling collision resolution approach, coupled with the exploration of comprehensive priority relationships, effectively addressing the deadlock situations. We also introduce a novel priority-binding algorithm to enhance the scalability of our planner in restricted spaces densely populated with robots. Experimental evaluations in various scenarios demonstrate that D-PBS outperforms standard approaches to MRMP, offering superior path quality and scalability for larger robot swarms.


ThAT11 Regular Session, 314	Add to My Program
Haptics 1

Chair: Moore, Carl A.	FAMU-FSU College of Engineering
Co-Chair: Chen, Cheng-Wei	National Taiwan University

08:30-08:35, Paper ThAT11.1	Add to My Program
Vision-Based Haptic Rendering with Self-Occlusion Resilience Using Shadow Correspondence

Mao, Mu-Ting	National Taiwan University
Chen, Cheng-Wei	National Taiwan University
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation, RGB-D Perception Abstract: Vision-based haptic feedback provides cost-effective preemptive protection and real-time guidance, enhancing teleoperation with reduced system complexity. However, challenges arise as the instrument approaches target object, leading to occlusion of the point cloud behind the remote instrument, known as the self-occlusion issue. Prior solutions relying on historical point clouds or multiple viewpoints to refill the occluded region encounter adaptability issues for prolonged occlusion and limited space, thus hindering practical implementation. This paper introduces a novel non-refilling-based method for haptic force rendering, leveraging the correspondence between the tool-tip position and the tip position of the shadow-like occluded region. Experimental results demonstrate the proposed method's resilience across self-occlusion and dynamic environments, highlighting its practical applicability in robotic teleoperation.

08:35-08:40, Paper ThAT11.2	Add to My Program
A New Expression for the Passivity Bound for a Class of Sampled-Data Systems

Roberts, Rodney	Florida State University
Moore, Carl A.	FAMU-FSU College of Engineering
Colgate, Edward	Northwestern University
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation, Force Control, Passivity Abstract: In this article, we characterize the passivity of a class of haptic systems modeled as a simple sampled-data system. Passivity is guaranteed by ensuring that there is enough damping in the haptic interface. A necessary and sufficient bound was determined in earlier work, but the corresponding mathematical expressions were complicated, and the derivations were not completely rigorous. In this article, a more tractable expression is derived. Based on the improved expression, passivity conditions are obtained for several classes of transfer functions representing virtual environments.

08:40-08:45, Paper ThAT11.3	Add to My Program
A Haptic Feedback Device Actuated by Electromagnetic Torque

Luo, Xionghuan	Hong Kong Institute of Science & Innovation, Chinese Academy Of
Huang, Yuanrui	Xi'an Jiaotong-Liverpool University
Zhao, Wenda	Institute of Automation，Chinese Academy of Sciences
Liu, Hongbin	Hong Kong Institute of Science & Innovation, Chinese Academy Of
Keywords: Haptics and Haptic Interfaces, Wearable Robotics, Virtual Reality and Interfaces Abstract: Haptic feedback enhances user interaction with systems by adding the sense of touch, thereby improving immersion and realism in applications like virtual reality (VR), augmented reality (AR), video games, education, and robotic surgery. To address the challenges in mechanically actuated haptic feedback devices such as limited mobility, mechanical wear, and complex mechanical structures, several research sought to develop electromagnetic haptic feedback systems. However, they also suffer from the rapid decay of magnetic force with distance, thus restricting their workspace size and application potential. In this paper, we propose a novel electromagnetic haptic feedback device that is actuated by magnetic torque instead of magnetic force. By controlling the magnetic torque, which decays with distance only at a third-order rate, our device achieves a large workspace—a 200-mm-diameter hemisphere—while still delivering perceptible real-time haptic feedback within the hemisphere. While using the device, the user wears a lightweight haptic thimble housing a permanent magnet on their finger, which enables 2 degree-of-freedom (DoF) haptic feedback. A 13-coil electromagnet array serves as the source of the magnetic field. A mathematical model is proposed to determine the currents in the electromagnet array to generate the desired amount of haptic feedback torque. We conducted two experiments to prove the viability of the device. A haptic feedback accuracy experiment was conducted and validated the device's ability to generate sufficient torque within a large workspace. A user evaluation experiment showed that the device achieved an overall accuracy of 77.86% in a virtual enclosure exploration task, indicating its effectiveness and usability in haptic feedback applications.

08:45-08:50, Paper ThAT11.4	Add to My Program
Vibrotactile Haptics with Soft Magnetoresponsive Surface Interface

Rimer, Evan	Queen's University, Ingenuity Labs Research Institute
Hashtrudi-Zaad, Keyvan	Queen's University
Robertson, Matthew	Queen's University
Keywords: Haptics and Haptic Interfaces, Soft Robot Materials and Design, Wearable Robotics Abstract: This paper explores the feasibility of using magnetoresponsive silicone as the primary mechanism for generating vibrotactile feedback in haptic interfaces. The distinctive feature of this research lies in the integration of magnetoresponsive silicone, a flexible material that responds to electromagnetic fields to produce localized vibrations. Preliminary experiments evaluate the performance of these actuators, focusing on their ability to produce controlled vibrations across a range of frequencies and amplitudes relevant to human tactile perception. Building on this foundation, we introduce the VibroFlex Pad, a haptic interface featuring a magnetoresponsive silicone sheet and an array of electromagnets. The VibroFlex Pad demonstrates its versatility in generating varied tactile effects and simulating dynamic wave-like movements across its surface. To assess the VibroFlex Pad's effectiveness, a user study was conducted, separately evaluating tactile accuracy, overall performance, and user comfort. The findings suggest that the VibroFlex Pad offers reliable and precise vibrotactile feedback, highlighting its potential to enhance wearable haptic technologies and improve the user experience in a variety of applications.

08:50-08:55, Paper ThAT11.5	Add to My Program
Haptic Shoulder for Rendering Biomechanically Accurate Joint Limits for Human-Robot Physical Interactions

Peiros, Lizzie	University of California, San Diego
Joyce, Calvin	University of California, San Diego
Murugesan, Tarun	University of California, San Diego
Nguyen, Roger	University of California, San Diego
Fiorini, Isabella	University of California, San Diego
Galibut, Rizzi	University of California, San Diego
Yip, Michael C.	University of California, San Diego
Keywords: Physical Human-Robot Interaction, Safety in HRI, Biologically-Inspired Robots Abstract: Human-robot physical interaction (pHRI) is a rapidly evolving research field with significant implications for physical therapy, search and rescue, and telemedicine. However, a major challenge lies in accurately understanding human constraints and safety in human-robot physical experiments without an IRB and physical human experiments. Concerns regarding human studies include safety concerns, repeatability, and scalability of the number and diversity of participants. This paper examines whether a physical approximation can serve as a stand-in for human subjects to enhance robot autonomy for physical assistance. This paper introduces the SHULDRD (Shoulder Haptic Universal Limb Dynamic Repositioning Device), an economical and anatomically similar device designed for real-time testing and deployment of pHRI planning tasks onto robots in the real world. SHULDRD replicates human shoulder motion, providing crucial force feedback and safety data. The device's open-source CAD and software facilitate easy construction and use, ensuring broad accessibility for researchers. By providing a flexible platform able to emulate infinite human subjects, ensure repeatable trials, and provide quantitative metrics to assess the effectiveness of the robotic intervention, SHULDRD aims to improve the safety and efficacy of human-robot physical interactions.

08:55-09:00, Paper ThAT11.6	Add to My Program
Experimental Evaluation of Haptic Shared Control for Multiple Electromagnetic Untethered Microrobots (I)

Ferro, Marco	CNRS
Pinan Basualdo, Franco Nicolas	Katholieke Universiteit Leuven
Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Misra, Sarthak	University of Twente
Pacchierotti, Claudio	Centre National De La Recherche Scientifique (CNRS)
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation, Micro/Nano Robots Abstract: The precise manipulation of microrobots presents challenges arising from their small size and susceptibility to external disturbances. To address these challenges, we present the experimental evaluation of a haptic shared control teleoperation framework for the locomotion of multiple microrobots, relying on a kinesthetic haptic interface and a custom electromagnetic system. Six combinations of haptic and shared control strategies are evaluated during a safe 3D navigation scenario in a cluttered environment. 18 participants are asked to steer two spherical magnetic microrobots among obstacles to reach a predefined goal, under different conditions. For each condition, participants are provided with different obstacle avoidance and navigation guidance cues. Results show that providing assistance in avoiding obstacles guarantees safer performance, regardless if the assistance is autonomous or delivered through a haptic repulsive force. Moreover, autonomous obstacle avoidance also reduces the completion time by 30% compared to haptic obstacle avoidance and no obstacle avoidance cases, although haptic feedback is preferred by the users. Finally, providing haptic guidance towards the target improves by the 65% the positioning accuracy of the microrobots with respect to not providing this guidance.


ThAT12 Regular Session, 315	Add to My Program
Assembly

Chair: Liu, Changliu	Carnegie Mellon University
Co-Chair: Bahar, Iris	Colorado School of Mines

08:30-08:35, Paper ThAT12.1	Add to My Program
StableLego: Stability Analysis of Block Stacking Assembly

Liu, Ruixuan	Carnegie Mellon University
Deng, Kangle	Carnegie Mellon University
Wang, Ziwei	Tsinghua University
Liu, Changliu	Carnegie Mellon University
Keywords: Assembly, Performance Evaluation and Benchmarking, Robotics and Automation in Construction Abstract: Structural stability is a necessary condition for successful construction of an assembly. However, designing a stable assembly requires a non-trivial effort since a slight variation in the design could significantly affect the structural stability. To address the challenge, this paper studies the stability of assembly structures, in particular, block stacking assembly. The paper proposes a new optimization formulation, which optimizes over force balancing equations, for inferring the structural stability of 3D block stacking structures. The proposed stability analysis is verified on hand-crafted Lego examples. The experiment results demonstrate that the proposed method can correctly predict whether the structure is stable. In addition, it outperforms the existing methods since it can accurately locate the weakest parts in the design, and more importantly, solve any given assembly structures. To further validate the proposed method, we provide StableLego: a comprehensive dataset including 50k+ 3D objects with their Lego layouts. We test the proposed stability analysis and include the stability inference for each corresponding object in StableLego. Our code and the dataset are available at https://github.com/intelligent-control-lab/StableLego.

08:35-08:40, Paper ThAT12.2	Add to My Program
Component Selection for Craft Assembly Tasks

Isume, Vitor Hideyo	Osaka University
Kiyokawa, Takuya	Osaka University
Yamanobe, Natsuki	Advanced Industrial Science and Technology
Domae, Yukiyasu	The National Institute of Advanced Industrial Science and Techno
Wan, Weiwei	Osaka University
Harada, Kensuke	Osaka University
Keywords: Assembly, Visual Learning, Computer Vision for Automation Abstract: Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.

08:40-08:45, Paper ThAT12.3	Add to My Program
Assembly Order Planning for Modular Structures by Autonomous Multi-Robot Systems

Peters, Tom	TU Eindhoven
Cheung, Kenneth C.	National Aeronautics and Space Administration (NASA)
Kostitsyna, Irina	KBR at NASA Ames Research Center
Keywords: Assembly, Path Planning for Multiple Mobile Robots or Agents, Parallel Robots Abstract: Coordinated multi-agent robotic construction provides a means to build infrastructure in extreme environments and improve efficiency in high performance applications. Planning methods are key to understanding and achieving the scope of such applications, and are typically tailored to specific models of construction material and a consideration of passivity or activity thereof. Here, we focus on the NASA Automated Reconfigurable Mission Adaptive Digital Assembly Systems (ARMADAS) model, which includes passive lightweight structural modules and small robots that traverse the structure. We present an algorithm for calculating a build plan for robots under the constraints of this type of system. We then evaluate the quality of this plan experimentally. Many of the techniques we use can be applied to any robotic assembly system whose robots perform locomotion over the structure that they are building.

08:45-08:50, Paper ThAT12.4	Add to My Program
Master Rules from Chaos: Learning to Reason, Plan, and Interact from Chaos for Tangram Assembly

Zhao, Chao	Hong Kong University of Science and Technology
Jiang, Chunli	The Hong Kong University of Science and Technology
Luo, Lifan	The Hong Kong University of Science and Technology
Zhang, Guanlan	The Hong Kong University of Science and Technology
Yu, Hongyu	The Hong Kong University of Science and Technology
Wang, Michael Yu	Mywang@gbu.edu.cn
Chen, Qifeng	HKUST
Keywords: Grasping, Assembly Abstract: Tangram assembly, the art of human intelligence and manipulation dexterity, is a new challenge for robotics and reveals the limitations of state-of-the-arts. Here, we describe our initial exploration and highlight key problems in reasoning, planning, and manipulation for robotic tangram assembly. We present MRChaos (Master Rules from Chaos), a robust and general solution for learning assembly policies that can generalize to novel objects. In contrast to conventional methods based on prior geometric and kinematic models, MRChaos learns to assemble randomly generated objects through self-exploration in simulation without prior experience in assembling target objects. The reward signal is obtained from the visual observation change without manually designed models or annotations. MRChaos retains its robustness in assembling various novel tangram objects that have never been encountered during training, with only silhouette prompts. We show the potential of MRChaos in wider applications such as cutlery combinations. The presented work indicates that radical generalization in robotic assembly can be achieved by learning in much simpler domains.

08:50-08:55, Paper ThAT12.5	Add to My Program
Robot Planning under Uncertainty for Object Assembly and Troubleshooting Using Human Causal Models

Basu, Semanti	Brown University
Tatlidil, Semir	Brown University
Kim, Moon Hwan	Brown University
Tran, Tiffany	Brown University
Saxena, Serena	Brown University
Williams, Tom	Colorado School of Mines
Sloman, Steven	Brown University
Bahar, Iris	Colorado School of Mines
Keywords: Human-Centered Robotics, Embodied Cognitive Science, Planning under Uncertainty Abstract: In this paper we explore if human mental models of objects, even when flawed, can be integrated with a collaborative robot's decision making framework to allow it to make smarter choices under partial observability for different object-related tasks such as assembly and troubleshooting. We demonstrate how (1) these informative causal models can be extracted from humans through crowdsourcing, (2) object assembly and troubleshooting can be formulated as Partially Observable Markov Decision Processes (POMDPs) and (3) our extracted causal models can be incorporated into those models in the form of approximate priors. Finally, (4) we use systematic experimentation in simulation to demonstrate the success of this approach, with 2X average improvement in reward observed for object assembly tasks, and 1.4X average improvement in reward observed for troubleshooting tasks.

08:55-09:00, Paper ThAT12.6	Add to My Program
Robotic Dry-Stacking of Clocháin with Irregular Stones

Liu, Yifang	Oak Ridge National Laboratory
Napp, Nils	Cornell University
Keywords: Robotics and Automation in Construction, Assembly Abstract: This paper explores automated robotic construction of clocháin, a type of corbelled rock shelter, traditionally crafted by skilled workers. While robots have been employed for simple dry-stacking tasks in the past, such as construction of stone walls or vertical stone towers, the question of whether robots possess the capacity to construct more functional structures remains unanswered. This study presents a significant step forward in robotic dry-stacking of functional structures: the assembly of natural stones into freestanding clocháin structures. We also present a set of stackability measures to aid stone selection, which significantly improves the stability of the planned structures. Our sequential filtering approach, originally designed for planning stone walls, plays a foundational role in achieving stable clochán construction. Experimental results validate the effectiveness of the stackability measures and demonstrate the physical execution of dry-stacking clocháin. The progress demonstrated in this paper opens the door to robotic construction of a wide range of utility structures in unstructured environments.


ThAT13 Regular Session, 316	Add to My Program
Reinforcement Learning Applications

Chair: Ekenna, Chinwe	University at Albany
Co-Chair: Roveda, Loris	SUPSI-IDSIA

08:30-08:35, Paper ThAT13.1	Add to My Program
Synthesizing Depowdering Trajectories for Robot Arms Using Deep Reinforcement Learning

Maurer, Maximilian	Festo SE & Co. KG
Seefeldt, Simon	University of Tübingen
Seyler, Jan Reinke	Festo SE & Co. KG
Eivazi, Shahram	University of Tübingen
Keywords: Reinforcement Learning, Task and Motion Planning, Representation Learning Abstract: Research into robotics applications of deep reinforcement learning (DRL) has increasingly been focussed on learning precise object manipulation and trajectory planning. Extending these tasks to continuous robot-object interactions with the surface of complex geometries remains an open problem. In this paper we investigate end-to-end DRL solutions for depowdering tasks that work by directing a pressurized air stream onto the object's surfaces using a blast nozzle head mounted on a robotic arm. We develop a GPU accelerated vectorized cleaning effect for integration into RL training and consider ways to expose vision-less trajectory synthesis for surface treatment applications to the RL agent based on UV mapping. Our experimental evaluation demonstrates that DRL has the potential to be used for generating object-specific agents for depowdering tasks on a variety of 3D objects without requiring intermediate path planners even in a full 3D motion setup. Finally, we show that DRL-generated trajectories can be transferred to a real-world setup. Our task formulation lends itself to approximate a wide range of surface treatment applications (e.g., cleaning and spray painting) with various effects.

08:35-08:40, Paper ThAT13.2	Add to My Program
World Model-Based Perception for Visual Legged Locomotion

Lai, Hang	Shanghai Jiao Tong University
Cao, Jiahang	Shanghai Jiao Tong University
Xu, Jiafeng	ByteDance
Wu, Hongtao	Bytedance
Lin, Yunfeng	Shanghai Jiao Tong University
Kong, Tao	ByteDance
Yu, Yong	Shanghai Jiao Tong University
Zhang, Weinan	Shanghai Jiao Tong University
Keywords: Reinforcement Learning, Legged Robots Abstract: Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/.

08:40-08:45, Paper ThAT13.3	Add to My Program
V-Pilot: A Velocity Vector Control Agent for Fixed-Wing UAVs from Imperfect Demonstrations

Gong, Xudong	National University of Defense Technology
Dawei, Feng	National University of Defense Technology
Xu, Kele	National University of Defense Technology
Zhou, Xing	National University of Defense Technology
Zheng, Si	Qiyuan Lab
Ding, Bo	National University of Defense Technology
Wang, Huaimin	National University of Defense Technology
Keywords: Reinforcement Learning, Learning from Demonstration, Aerial Systems: Applications Abstract: This paper addresses the challenge of Velocity Vector Control (VVC) for fixed-wing UAVs using Reinforcement Learning (RL) in the presence of imperfect demonstrations. The multi-objective and long-horizon nature of VVC introduces significant spatial and temporal complexities, complicating RL's exploration. While demonstration-based RL methods can help mitigate exploration challenges, their effectiveness is often limited by the quality of the provided demonstrations. To tackle this, we propose V-Pilot, a novel approach that integrates: (1) a controller equipped with a control law model to reduce action oscillation, thus alleviating temporal exploration issues, and (2) a VVC-specific training workflow for iterative policy refinement and demonstration quality improvement. This framework is designed to enhance the performance of demonstration-based RL under imperfect demonstrations. We evaluate V-Pilot on the fixed-wing UAV RL environment, FlyCraft. Experimental results demonstrate that V-Pilot outperforms PID and Behavioral Cloning across multiple performance metrics.

08:45-08:50, Paper ThAT13.4	Add to My Program
Efficiently Generating Expressive Quadruped Behaviors Via Language-Guided Preference Learning

Clark, Jaden	Stanford University
Hejna, Donald	Stanford University
Sadigh, Dorsa	Stanford University
Keywords: Reinforcement Learning, Social HRI, Emotional Robotics Abstract: Expressive robotic behavior is essential for the widespread acceptance of robots in social environments. Recent advancements in learned legged locomotion controllers have enabled more dynamic and versatile robot behaviors. However, determining the optimal behavior for interactions with different users across varied scenarios remains a challenge. Current methods either rely on natural language input, which is efficient but low-resolution, or learn from human preferences, which, although high-resolution, is sample inefficient. This paper introduces a novel approach that leverages priors generated by pre-trained LLMs alongside the precision of preference learning. Our method, termed Language-Guided Preference Learning (LGPL), uses LLMs to generate initial behavior samples, which are then refined through preference-based feedback to learn behaviors that closely align with human expectations. Our core insight is that LLMs can guide the sampling process for preference learning, leading to a substantial improvement in sample efficiency. We demonstrate that LGPL can quickly learn accurate and expressive behaviors with as few as four queries, outperforming both purely language-parameterized models and traditional preference learning approaches. Website with videos: lgpl-gaits.github.io/

08:50-08:55, Paper ThAT13.5	Add to My Program
Learning Multi-Agent Coordination for Replenishment at Sea

Han, Byeolyi	Georgia Institute of Technology
Cho, Minwoo	Georgia Institute of Technology
Chen, Letian	Georgia Institute of Technology
Paleja, Rohan	MIT Lincoln Laboratory
Wu, Zixuan	Georgia Institute of Technology
Ye, Sean	Zoox
Seraj, Esmaeil	Georgia Institute of Technology
Sidoti, David	US Naval Research Laboratory
Gombolay, Matthew	Georgia Institute of Technology
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems, Reinforcement Learning Abstract: Optimizing large-scale logistics is computationally challenging due to its scale and requirement to be robust to stochastic and time-varying weather disturbances. However, prior research in multi-agent reinforcement learning (MARL) does not address scenarios that capture complexity of logistics operations influenced by dynamic weather patterns. To address this gap, we suggest a new MARL environment, textsc{Marine} that has two types of agents equipped with limited resources and integrates real wave data to model the influences of weather on the replenishment at sea (RAS) operation. To this end, we propose SchedHGNN, a novel MARL algorithm that incorporates a heterogeneous graph neural network and an intrinsic reward scheme to enhance agent coordination and mitigate challenges induced by environment non-stationarity. Our results show that the combination of effective RAS scheduling and improved communication enables our model to outperform competitive baselines by up to 37.8%. This achievement marks a significant advancement in applying MARL to complex, real-world logistics scenarios.


ThAT14 Regular Session, 402	Add to My Program
Exoskeletons

Chair: Sharma, Nitin	North Carolina State University
Co-Chair: Zarrouk, David	Ben Gurion University

08:30-08:35, Paper ThAT14.1	Add to My Program
Real-Time Ultrasound Imaging of a Human Muscle to Optimize Shared Control in a Hybrid Exoskeleton

Iyer, Ashwin	North Carolina State University
Sun, Ziyue	NCSU
Lambeth, Krysten	North Carolina State University
Singh, Mayank	North Carolina State University
Cleveland, Christine	University of North Carolina-Chapel Hill
Sharma, Nitin	North Carolina State University
Keywords: Prosthetics and Exoskeletons, Optimization and Optimal Control, Rehabilitation Robotics, Ultrasound Imaging Abstract: A hybrid exoskeleton is a class of wearable robotic technology that simultaneously uses a powered exoskeleton and functional electrical stimulation (FES) to generate assistive joint torques for people with impaired mobility due to neurological disorders such as spinal cord injury (SCI). The hybrid assistive technology benefits from FES that actively elicits force from paralyzed muscles via their neural excitation, leading to muscle strengthening. The main technical barrier to realizing the hybrid technology is to attain stable coordination between FES and the exoskeleton despite the quick onset of FES-induced muscle fatigue, which causes a rapid decline in the muscle force. Current methods to measure the induced fatigue lack direct muscle state measurements and may be ineffective at capturing the muscle force decay due to FES. Instead, ultrasound (US) imaging accurately quantifies FES-related muscle contractility and fatigue due to the direct visualization of muscle fibers. In this paper, we use real-time US imaging-derived muscle strain changes as biomarkers of FES-induced fatigue in an optimal controller that modulates exoskeleton assistance and FES dosage. To demonstrate that real-time US imaging is a promising muscle-machine interface technology that can optimize shared control in a hybrid exoskeleton, we perform experiments involving continuous seated knee extension and over-ground walking tasks on two participants with SCI and four participants without disabilities. Furthermore, this work helps design a novel and unprecedented robotic gait technology with the capability to impart FES-associated therapeutic benefits while assisting the gait of neurologically impaired individuals, including those with SCI, stroke, multiple sclerosis, etc.

08:35-08:40, Paper ThAT14.2	Add to My Program
Design and Control of a Novel Semi-Passive Knee Exoskeleton

Sade, Alon	Ben Gurion University of the Negev
Coifman, Itay	Ben Gurion University of the Negev
Riemer, Raziel	Ben-Gurion University of the Negev
Zarrouk, David	Ben Gurion University
Keywords: Mechanism Design, Prosthetics and Exoskeletons, Wearable Robotics Abstract: This paper presents a novel semi-passive knee exoskeleton designed to provide running assistance. It incorporates an energy-efficient clutch mechanism activated by a mini servomotor which engages and disengages the spring that supports the leg during running. The exoskeleton extracts energy during the running phase when the muscles are acting as brakes (negative power), stores it in the spring, and then returns this energy during the positive power phase (when the muscles are acting as motors). The exoskeleton controller implements an inertial measurement unit (IMU) sensor to estimate the shank orientation that determines when to engage and disengage the spring. Two experiments designed to probe the functionality of the exoskeleton were conducted to evaluate its control performance and actuation, and the exoskeleton's biomechanical impact on three subjects. The findings showed that the control mechanism could be engaged and disengaged in real time. The maximum moment created on the knee muscles was 17 Nm, although the device could supply 28 Nm. The ratio of the consumed servo energy consumption to the subjects’ saved energy was 1:160 (0.1W input to 16W saved). This study thus paves the way for the development of lightweight, inexpensive exoskeletons that can contribute to their greater availability for a broader range of individuals.

08:40-08:45, Paper ThAT14.3	Add to My Program
Model-Based Control Strategies Comparison of One Bionic Ankle Tensegrity Exoskeleton: BATE

Wei, Dunwen	University of Electronic Science and Technology of China
Mao, Shiyu	University of Electronic Science and Technology of China
Zhang, Zhichao	University of Electronic Science and Technology of China
Wei, Ximing	University of Electronic Science and Technology of China
Gao, Tao	University of Electronic Science and Technology of China
Ficuciello, Fanny	Università Di Napoli Federico II
Keywords: Rehabilitation Robotics, Wearable Robotics, Biologically-Inspired Robots Abstract: This paper presents a comparative analysis of model-based control strategies for a Bionic Ankle Tensegrity Exoskeleton (BATE). The BATE is designed to mimic the self-stress equilibrium and self-supporting characteristics of the human ankle biotensegrity structure. Model-based control strategies are conventional methods that can help discover the principles of complex tensegrity systems. The high dimensions and non-linearity of the BATE pose challenges for physical modelling and require unique model-based control strategies. In this study, we propose a modelling method that considers interaction force and explore the trajectory tracking performance and robustness of the ankle exoskeleton under three power-assisted control methods: position control, force control, and hybrid force-position contorl. The experimental results suggest that the PC method offers superior tracking performance and robustness compared to the other two methods. This method can be used for early rehabilitation training to improve flexibility. The control concept emphasizes its advantages over current wearable exoskeletons and introduces new ideas for high-performance exoskeletons.

08:45-08:50, Paper ThAT14.4	Add to My Program
Human-Like Walking Motion Generation for Self-Balancing Lower Limb Rehabilitation Exoskeletons

Yang, Ming	University of Science and Technology of China
Chen, Ziqiang	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Li, Wentao	Shenzhen Institute of Advanced Technology, Chinese Academy of Sc
Li, Feng	Shenzhen Institute of Advanced Technology Chinere Academy of Sci
Shang, Weiwei	University of Science and Technology of China
Tian, Dingkui	Shenzhen Advanced Technology Research Institute, Chinese Academy
Wu, Xinyu	CAS
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Wearable Robotics Abstract: Self-balancing lower limb rehabilitation exoskeletons (SLLREs) allow individuals with lower limb dysfunction to walk without the use of crutches. Stable and human-like walking motions are crucial for SLLREs because achieving a close imitation of healthy human walking is a key goal in rehabilitation therapy. Existing SLLREs can realize stable walking but lack human-like features. This paper designs a walking motion generator based on hierarchical optimization to generate a human-like walking motion with variable hip height, heel-strike, toe-off, and knee-stretched features. This hierarchically optimized human-like walking motion generator consists of a knee-stretched optimizer and an optimization-based stabilizing filter. Specifically, the knee-stretched optimizer realizes the stretched knee feature by optimizing the hip trajectory with varying heights. And the stabilizing filter realizes stable walking by optimizing the hip trajectory in the sagittal plane direction.To validate the effectiveness of the proposed human-like walking motion generator, walking experiments were conducted on SLLRE AutoLEE-G3 both in a simulation environment and the real world. The experimental results show that the human-like walking motions look more natural and reduce the required torque for the knee joint compared with knee-bent walking.

08:50-08:55, Paper ThAT14.5	Add to My Program
Kinematic Benefits of a Cable-Driven Exosuit for Head-Neck Mobility

Bales, Ian	University of Utah
Zhang, Haohan	University of Utah
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Tendon/Wire Mechanism Abstract: This letter presents a novel cable-driven exosuit intended for head-neck support and movement assistance. Mobility limitations in the head-neck, such as dropped head syndrome, can result from various neurological disorders. Current solutions, ranging from static neck collars to rigid-link robotic neck exoskeletons, are unsatisfactory. Neck collars are the most used clinically but fail to restore head-neck motion. Rigid-link neck exoskeletons can enable head movement but are bulky and restrictive. In this letter, we present the design of this exosuit, an analysis of its ability to balance the gravitational moment of the head in simulation, and the results of a user study comparing its kinematic performance to a state-of-the-art rigid-link neck exoskeleton. The exosuit is able to support the head across its full range of motion according to simulation results. It fits users of different sizes and participants exhibited more natural head-neck movement wearing the exosuit as compared to wearing the rigid-link exoskeleton. The exosuit allowed more head rotations than the rigid-link neck exoskeleton and required less compensatory torso movement for three daily tasks (looking for traffic, drinking from a bottle, and picking up an object from the floor). Its absolute range of motion was also much larger than the one allowed by the rigid-link neck exoskeleton. These results demonstrate the kinematic benefits of a cable-driven neck exosuit and provide justification for studying the use of such an exosuit for head-neck movement assistance in patient groups.


ThAT15 Regular Session, 403	Add to My Program
Continuum Robots 1

Chair: Morimoto, Tania K.	University of California San Diego
Co-Chair: Yuan, Sichen	The University of Alabama

08:30-08:35, Paper ThAT15.1	Add to My Program
PH-Gauss-Lobatto Reduced-Order-Model for Shape Control of Soft-Continuum Manipulators

Mbakop, Steeve	Junia
Tagne, Gilles	Yncréa Hauts De France / ISEN Lille
Chevillon, Tanguy	Junia
Drakunov, Sergey	IHMC
Merzouki, Rochdi	CRIStAL, CNRS UMR 9189, University of Lille1
Keywords: Modeling, Control, and Learning for Soft Robots, Kinematics, Motion and Path Planning, Soft Robot Applications Abstract: Soft and hyper-elastic materials possess properties of resilience and flexibility, characterizing a class of Soft-Continuum Manipulators (SCM). The latter describes a robot structure with an infinite number of degrees of freedom (DoFs), useful for mobility and manipulation. However, these geometric characteristics are source of modeling and control problems. In this paper, a Pythagorean Hodograph (PH) curve based Reduced-Order-Model (ROM) relying on the Gauss-Lobatto quadrature is investigated for the modeling and the control of SCM. This allows, first, reducing the dimension of the SCM kinematics based on the PH parametric curves with a predefined length and second, developing the shape kinematics control from its control polygon. The use of the Gauss-Lobatto quadrature allows to move independently the PH curve control points, while preserving PH features of length and minimum curve energy. These features are important to control in real-time the shape of the SCM. The proposed approach has been validated numerically and experimentally, carried out on a bio-inspired Soft continuum Elephant Trunk Robot.

08:35-08:40, Paper ThAT15.2	Add to My Program
Towards Contact-Aided Motion Planning for Tendon-Driven Continuum Robots

Rao, Priyanka	University of Toronto
Salzman, Oren	Technion
Burgner-Kahrs, Jessica	University of Toronto
Keywords: Modeling, Control, and Learning for Soft Robots, Motion and Path Planning, Soft Robot Applications Abstract: Tendon-driven continuum robots (TDCRs), with their flexible backbones, offer the advantage of being used for navigating complex, cluttered environments. However, to do so, they typically require multiple segments, often leading to complex actuation and control challenges. To this end, we propose a novel approach to navigate cluttered spaces effectively for a single-segment long TDCR which is the simplest topology from a mechanical point of view. Our key insight is that by leveraging contact with the environment we can achieve multiple curvatures without mechanical alterations to the robot. Specifically, we propose a search-based motion planner for a single-segment TDCR. This planner, guided by a specially designed heuristic, discretizes the configuration space and employs a best-first search. The heuristic, crucial for efficient navigation, provides an effective cost-to-go estimation while respecting the kinematic constraints of the TDCR and environmental interactions. We empirically demonstrate the efficiency of our planner-testing over 525 queries in environments with both convex and non-convex obstacles, our planner is demonstrated to have a success rate of about 80% while baselines were not able to obtain a success rate higher than 30%. The difference is attributed to our novel heuristic which is shown to significantly reduce the required search space.

08:40-08:45, Paper ThAT15.3	Add to My Program
A Simple Dynamics Model for Cable Driven Continuum Robots with Actuator Coupling

Watson, Connor	Morimoto Lab, UCSD
Morimoto, Tania K.	University of California San Diego
Keywords: Modeling, Control, and Learning for Soft Robots, Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism Abstract: The flexibility and dexterity of cable-driven continuum robots (CDCRs) make them well-suited for intricate tasks such as minimally invasive surgery. However, the complexity of accurately modeling their dynamics has limited their broader adoption and effective control. Current models either oversimplify the dynamics by assuming quasi-static conditions or overcomplicate them, making real-time application challenging. Additionally, many existing models neglect the critical coupling between the robot's body and actuator dynamics, a factor essential for accurate control. In this paper, we propose a new, minimal dynamics model for CDCRs that strikes a balance between simplicity and accuracy. Our model captures the essential dynamics of both the robot and its actuators, providing a practical tool for control design. We also establish connections between our model and those used for other robotic systems, enabling the transfer of well-established control strategies to CDCRs. The model is validated through hardware experiments, demonstrating its capability to effectively address complex control challenges in CDCR applications.

08:45-08:50, Paper ThAT15.4	Add to My Program
A Novel Tendon-Driven Articulated Continuum Robot with Stabilized Self-Locking Joints

Ren, Jiankun	Fudan University
Qi, Lizhe	Fudan University
Jia, Yu	Fudan University
Wang, Hecheng	Fudan University
Wang, Ziheng	Academy for Engineering & Technology, Fudan University
Sun, Yunquan	Fudan University
Keywords: Mechanism Design, Tendon/Wire Mechanism, Actuation and Joint Mechanisms Abstract: Articulated continuum robots (ACRs) are characterized by flexibility, controllability, and adaptability and perform excellently in complex and constrained environments. However, the large number of motor drives limit the ACRs' portability and make them cumbersome to control. This paper presents a novel tendon-driven ACR composed of stabilized self-locking joints (SLJs) connected in series. After triggering the mechanical constraints with shape memory alloy coils, each joint can be maintained in either a self-locking or release state with zero power consumption. Consequently, even with a single set of drive units, the ACR can operate in multiple modes, enabling variable motion performance and workspace adaptability, effectively reducing the number of motors. The ACR's stiffness also varies with the locking state of its SLJs, and no motor drive is required to maintain its shape when all SLJs are self-locking. The performance and reliability of the SLJ prototype were validated. The workspace of the ACR prototype model was analyzed, and its partial motion performance, motion error, and variable stiffness were verified.

08:50-08:55, Paper ThAT15.5	Add to My Program
Tensiworm: A Novel Tensegrity Robot with Enhanced Peristaltic Locomotion Efficiency

Kazoleas, Christian	The University of Alabama
Zhang, Jiajun	The University of Alabama
Yuan, Sichen	The University of Alabama
Keywords: Actuation and Joint Mechanisms, Biomimetics, Soft Robot Materials and Design Abstract: Tensegrity structures have been widely explored for their lightweight, high-stiffness, and foldable properties. These unique characteristics have enabled their application in various fields including robotics. Tensegrity robots have demonstrated diverse locomotion modes offering versatile solutions for navigation in complex environments. Recent efforts in bio-inspired robotics have led to designs mimicking the movement of natural organisms, such as earthworms. However, existing designs, particularly those utilizing motor-pulley mechanisms for robot body contraction, face significant challenges due to their bulky actuation systems that reduce locomotion efficiency. This paper introduces a novel tensegrity robot, "Tensiworm," inspired by the peristaltic locomotion of an earthworm. Composed of three icosahedron tensegrity unit cells connected in series, the Tensiworm robot employs a sequential contraction and relaxation mechanism driven by active cable members made of shape memory actuators. This innovative design achieves a 59.13% folding ratio and weighs only 46.9 grams. The robot can travel a distance equal to its body length in approximately ten cycles with an average speed of 10.01 mm per minute. Furthermore, the use of thinner, flexible structural members broadens possibilities for development of millimeter-scale tensegrity robots, which hold significant potential for biomedical applications, including in-vivo testing and targeted drug delivery.

08:55-09:00, Paper ThAT15.6	Add to My Program
Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Chen, Hao	University of Chinese Academy of Sciences
Chen, Jian	Hong Kong Institute of Science and Innovation, Chinese Academy O
Liu, Xinran	University of Chinese Academy of Sciences
Zhang, Zihui	Institute of Automation, Chinese Academy of Sciences
Huang, Yuanrui	Xi'an Jiaotong-Liverpool University
Zhang, Zhongkai	University of Montpellier, CNRS
Liu, Hongbin	Institute of Automation，Chinese Academy of Sciences
Keywords: Simulation and Animation, Contact Modeling, Medical Robots and Systems Abstract: Continuum 机器人提供高度的灵活性和多个自由度，使其成为导航窄流腔的理想选择。然而，准确模拟它们在大变形和频繁环境接触下的行为仍然具有挑战性。当前求解这些机器人变形的方法，例如模型降阶法和高斯-塞德尔（GS）方法，都存在明显的缺点。随着接触点数量的增加，他们的计算速度会降低，并且难以在速度和模型精度之间取得平衡。为了克服这些限制，我们引入了一种名为 Acc-FEM 的新型有限元方法（FEM）。Acc-FEM 采用大变形准静态有限元模型，并集成了加速求解器方案，以高效处理多触点仿真。此外，它还利用图形处理单元（GPU）的并行计算来实时更新有限元模型和


ThAT16 Regular Session, 404	Add to My Program
Grasping 3

Chair: Sun, Yu	University of South Florida
Co-Chair: Natale, Lorenzo	Istituto Italiano Di Tecnologia

08:30-08:35, Paper ThAT16.1	Add to My Program
Multi-Object Grasping -- Experience Forest for Robotic Finger Movement Strategies

Chen, Tianze	University of South Florida
Sun, Yu	University of South Florida
Keywords: Logistics, Grasping Abstract: This paper introduces a novel Experience Forest algorithm designed for multi-object grasping (MOG). Different from single-object grasping, for MOG, the hand poses a few steps before the end of grasping play important roles in the success of MOG. But similar to single-object grasping, the hand poses that are far away from the end grasping pose are not as relevant. Therefore, the proposed approach invented the Experience Forest structure to organize the finger movement sequences collected in naive MOG approaches with a set of trees instead of a single tree. The algorithm propagates success or failure results in the trials from end-pose nodes only to the nodes representing several preceding hand poses. When using the trees to generate a grasping sequence, the algorithm generates a finger-movement policy that follows a MOG synergy at the beginning and then transits to a tree in the Experience Forest and then employs a breadth-first search to achieve a more reliable solution. Tested on various objects using a UR5e robotic arm and Barrett hand in both simulated and real environments, the strategy significantly boosts efficiency in object transfer tasks by up to 60%, marking a 10% improvement over our previous methods.

08:35-08:40, Paper ThAT16.2	Add to My Program
VMF-Contact: Uncertainty-Aware Evidential Learning for Probabilistic Contact-Grasp in Noisy Clutter

Shi, Yitian	Karlsruhe Institute of Technology
Welte, Edgar	Karlsruhe Institute of Technology (KIT)
Gilles, Maximilian	Karlsruhe Institute of Technology
Rayyes, Rania	Karlsruhe Institute for Technology (KIT)
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping Abstract: Grasp learning in noisy environments, such as occlusions, sensor noise, and out-of-distribution (OOD) objects, poses significant challenges. Recent learning-based approaches focus primarily on capturing aleatoric uncertainty from inherent data noise. The epistemic uncertainty, which represents the OOD recognition, is often addressed by ensembles with multiple forward paths, limiting real-time application. In this paper, we propose an uncertainty-aware approach for 6-DoF grasp detection using evidential learning to comprehensively capture both uncertainties in real-world robotic grasping. As a key contribution, we introduce vMF-Contact, a novel architecture for learning hierarchical contact grasp representations with probabilistic modeling of directional uncertainty as von Mises–Fisher (vMF) distribution. To achieve this, we analyze the theoretical formulation of the second-order objective on the posterior parametrization, providing formal guarantees for the model's ability to quantify uncertainty and improve grasp prediction performance. Moreover, we enhance feature expressiveness by applying partial point reconstructions as an auxiliary task, improving the comprehension of uncertainty quantification as well as the generalization to unseen objects. In the real-world experiments, our method demonstrates a significant improvement by 39% in the overall clearance rate compared to the baselines. The code is available under: https://github.com/YitianShi/vMF-Contact/tree/main

08:40-08:45, Paper ThAT16.3	Add to My Program
QuadWBG: Generalizable Quadrupedal Whole-Body Grasping

Wang, Jilong	Galaxy General Robot Co., Ltd
Rajabov, Javokhirbek	Peking University
Xu, Chaoyi	Beijing University of Posts and Telecommunications
Zheng, Yiming	University of Toronto
Wang, He	Peking University
Keywords: Mobile Manipulation, Legged Robots, Whole-Body Motion Planning and Control Abstract: Legged robots with advanced manipulation capabilities have the potential to significantly improve household duties and urban maintenance. Despite considerable progress in developing robust locomotion and precise manipulation methods, seamlessly integrating these into cohesive whole-body control for real-world applications remains challenging. In this paper, we present a modular framework for robust and generalizable whole-body loco-manipulation controller based on a single arm-mounted camera. By using reinforcement learning (RL), we enable a robust low-level policy for command execution over 5 dimensions (5D) and a grasp-aware high-level policy guided by a novel metric, Generalized Oriented Reachability Map (GORM). The proposed system achieves state-of-the-art one-time grasping accuracy of 89% in real world, including challenging tasks such as grasping transparent objects. Through extensive simulations and real-world experiments, we demonstrate that our system can effectively manage a large workspace, from floor level to above body height, and perform diverse whole-body loco-manipulation tasks.

08:45-08:50, Paper ThAT16.4	Add to My Program
Composing Dextrous Grasping and In-Hand Manipulation Via Scoring with a Reinforcement Learning Critic

Röstel, Lennart	Technical University of Munich
Winkelbauer, Dominik	DLR
Pitz, Johannes	Technical University of Munich
Sievers, Leon	German Aerospace Center
Bäuml, Berthold	Technical University of Munich
Keywords: Deep Learning in Grasping and Manipulation, In-Hand Manipulation, Dexterous Manipulation Abstract: In-hand manipulation and grasping are fundamental yet often separately addressed tasks in robotics. For deriving in-hand manipulation policies, reinforcement learning has recently shown great success. However, the derived controllers are not yet useful in real-world scenarios because they often require a human operator to place the objects in suitable initial (grasping) states. Finding stable grasps that also promote the desired in-hand manipulation goal is an open problem. In this work, we propose a method for bridging this gap by leveraging the critic network of a reinforcement learning agent trained for in-hand manipulation to score and select initial grasps. Our experiments show that this method significantly increases the success rate of in-hand manipulation without requiring additional training. We also present an implementation of a full grasp manipulation pipeline on a real-world system, enabling autonomous grasping and reorientation even of unwieldy objects.

08:50-08:55, Paper ThAT16.5	Add to My Program
Bring Your Own Grasp Generator: Leveraging Robot Grasp Generation for Prosthetic Grasping

Stracquadanio, Giuseppe	Italian Institute of Technology
Vasile, Federico	Istituto Italiano Di Tecnologia
Maiettini, Elisa	Humanoid Sensing and Perception, Istituto Italiano Di Tecnologia
Boccardo, Nicolò	IIT - Istituto Italiano Di Tecnologia
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Keywords: Deep Learning in Grasping and Manipulation, Sensor Fusion, Prosthetics and Exoskeletons Abstract: One of the most important research challenges in upper-limb prosthetics is enhancing the user-prosthesis communication to closely resemble the experience of a natural limb. As prosthetic devices become more complex, users often struggle to control the additional degrees of freedom. In this context, leveraging shared-autonomy principles can significantly improve the usability of these systems. In this paper, we present a novel eye-in-hand prosthetic grasping system that follows these principles. Our system initiates the approach-to-grasp action based on user's command and automatically configures the DoFs of a prosthetic hand. First, it reconstructs the 3D geometry of the target object without the need of a depth camera. Then, it tracks the hand motion during the approach-to-grasp action and finally selects a candidate grasp configuration according to user's intentions. We deploy our system on the Hannes prosthetic hand and test it on able-bodied subjects and amputees to validate its effectiveness. We compare it with a multi-DoF prosthetic control baseline and find that our method enables faster grasps, while simplifying the user experience. Code and demo videos are available online at this https URL.


ThAT17 Regular Session, 405	Add to My Program
Localization 5

Chair: Lu, Guoyu	University of Georgia
Co-Chair: Jiao, Jianhao	University College London

08:30-08:35, Paper ThAT17.1	Add to My Program
AIR-HLoc: Adaptive Retrieved Images Selection for Efficient Visual Localisation

Liu, Changkun	The Hong Kong University of Science and Technology
Jiao, Jianhao	University College London
Huang, Huajian	The Hong Kong University of Science and Technology
Ma, Zhengyang	The Hong Kong University of Science and Technology
Kanoulas, Dimitrios	University College London
Braud, Tristan	HKUST
Keywords: Localization, SLAM, Visual Learning Abstract: State-of-the-art hierarchical localisation pipelines (HLoc) employ image retrieval (IR) to establish 2D-3D correspondences by selecting the top-k most similar images from a reference database. While increasing k improves localisation robustness, it also linearly increases computational cost and runtime, creating a significant bottleneck. This paper investigates the relationship between global and local descriptors, showing that greater similarity between the global descriptors of query and database images increases the proportion of feature matches. Low similarity queries significantly benefit from increasing k, while high similarity queries rapidly experience diminishing returns. Building on these observations, we propose an adaptive strategy that adjusts k based on the similarity between the query's global descriptor and those in the database, effectively mitigating the feature-matching bottleneck. Our approach reduces computational costs and processing time without sacrificing accuracy. Experiments on three indoor and outdoor datasets show that AIR-HLoc reduces feature matching time by up to 30% while preserving state-of-the-art accuracy. The results demonstrate that AIR-HLoc facilitates a latency-sensitive localisation system.

08:35-08:40, Paper ThAT17.2	Add to My Program
NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features

Zhai, Hongjia	Zhejiang University
Boming, Zhao	Zhejiang University
Li, Hai	Zhejiang University
Pan, Xiaokun	Zhejiang University
He, Yijia	TCL RayNeo
Cui, Zhaopeng	Zhejiang University
Bao, Hujun	Zhejiang University
Zhang, Guofeng	Zhejiang University
Keywords: Localization, Mapping, RGB-D Perception Abstract: Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D-3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approach, our method achieves a 3x faster training speed and a 45x reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods.

08:40-08:45, Paper ThAT17.3	Add to My Program
LiftFeat: 3D Geometry-Aware Local Feature Matching

Liu, Yepeng	Wuhan University
Lai, Wenpeng	SFMAP Technology
Zhao, Zhou	Central China Normal University
Xiong, Yuxuan	Wuhan University
Zhu, Jinchi	Wuhan University
Cheng, Jun	Institute for Infocomm Research, A*STAR
Xu, Yongchao	Wuhan University
Keywords: Deep Learning for Visual Perception, Visual Learning, Localization Abstract: Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called LiftFeat, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.

08:45-08:50, Paper ThAT17.4	Add to My Program
DVS-Aware Visual Perception for Mobile Robots with Neuromorphic Hardware

Zhong, Hanzhong	Tsinghua University
Jin, YingJie	Lenovo Research
Li, Guangbin	Lenovo Research
Li, Xiang	Tsinghua University
Wang, Zhepeng	Lenovo Research
Keywords: Neurorobotics, Deep Learning for Visual Perception, Sensor-based Control Abstract: The Dynamic Vision Sensor (DVS) is a distinctive visual sensor that exclusively responds to alterations in pixel brightness, enabling the real-time capture of swift and subtle movements with reduced power consumption and data bandwidth requirements. This paper proposes a DVS-aware visual perception method and presents its application for pose estimation of mobile robots. Specifically, a new marker is designed to provide pose reference data that leverages the inherent advantages of DVS more effectively. Moreover, we formulate a pose recognition system incorporating DVS, an algorithm based on Spiking Convolutional Neural Networks (SCNN) and a neuromorphic computing accelerator (Lynxi HS110). Such a formulation can well explore the DVS's advantages, as its event-triggered feature matches the nature of SCNN while the neuromorphic hardware enables efficient, low-power execution, making the system highly suitable for real-time embedded applications. Comparative analysis with traditional ARcode-based pose recognition methods reveals that our innovative approach demonstrates significant advantages in recognition speed and energy efficiency. The whole system is deployed on mobile robots and evaluated in real-world scenarios.

08:50-08:55, Paper ThAT17.5	Add to My Program
Feedback RoI Features Improve Aerial Object Detection

Ren, Botao	Tsinghua University
Xu, Botian	Tsinghua University
Wang, Jingyi	Tsinghua University
Gao, Hanwei	SAIC AILab
Yu, Qiankun	Tsinghua University
Deng, Zhidong	Tsinghua University
Keywords: Object Detection, Segmentation and Categorization, Aerial Systems: Applications, Aerial Systems: Perception and Autonomy Abstract: Research in visual perception has shown that the human visual system utilizes high-level feedback information to guide lower-level processing, enabling adaptation to signals of varying characteristics. Inspired by this, we propose the Feedback multi-Level feature EXtractor (Flex) to dynamically adjust feature selection in object detection based on image-wise and instance-level feedback information. This is particularly beneficial for applications such as aerial object detection, UAV-based target recognition and autonomous vehicle navigation, where global image quality issues like sensor degradation, foggy, or rainy conditions can impact detection performance. Flex adapts to variations in image quality, refining the feature extraction process to improve robustness against these challenges. Experimental results demonstrate that Flex consistently enhances a range of state-of-the-art methods on challenging aerial object detection datasets, including DOTA-v1.0, DOTA-v1.5, and HRSC2016. Furthermore, additional experiments on MS COCO confirm the module's effectiveness in general object detection tasks. Our quantitative and qualitative analyses reveal that the improvements are strongly correlated with image quality, aligning with our original motivation to address global image quality issues in real-world scenarios.

08:55-09:00, Paper ThAT17.6	Add to My Program
Keypoint Detection and Description for Raw Bayer Images

Lin, Jiakai	University of Georgia
Zhang, Jinchang	University of Georgia
Lu, Guoyu	University of Georgia
Keywords: Visual Tracking, Vision-Based Navigation, Visual Learning Abstract: Keypoint detection and local feature description are fundamental tasks in robotic perception, critical for applications such as SLAM, robot localization, feature matching, pose estimation, and 3D mapping. While existing methods predominantly operate on RGB images, we propose a novel network that directly processes raw images, bypassing the need for the Image Signal Processor (ISP). This approach significantly reduces hardware requirements and memory consumption, which is crucial for robotic vision systems. Our method introduces two custom-designed convolutional kernels capable of performing convolutions directly on raw images, preserving inter-channel information without converting to RGB. Experimental results show that our network outperforms existing algorithms on raw images, achieving higher accuracy and stability under large rotations and scale variations. This work represents the first attempt to develop a keypoint detection and feature description network specifically for raw images, offering a more efficient solution for resource-constrained environments.


ThAT18 Regular Session, 406	Add to My Program
Planning under Uncertainty 1

Chair: Tariq, Faizan M.	Honda Research Institute USA, Inc
Co-Chair: Kennedy, Monroe	Stanford University

08:30-08:35, Paper ThAT18.1	Add to My Program
Delayed-Decision Motion Planning in the Presence of Multiple Predictions

Isele, David	University of Pennsylvania, Honda Research Institute USA
Anon, Alexandre Miranda	Honda Research Institute USA
Tariq, Faizan M.	Honda Research Institute USA, Inc
Yeh, Zheng-Hang	Honda Research Institute
Singh, Avinash	Honda Research Institute, USA
Bae, Sangjae	Honda Research Institute, USA
Keywords: Motion and Path Planning, Planning under Uncertainty, Autonomous Agents Abstract: Reliable automated driving technology is challenged by various sources of uncertainties, in particular, behavioral uncertainties of traffic agents. It is not uncommon for traffic agents to contain multiple intentions followed by distinguishable maneuvers, and the automated driving car must reflect the uncertainty. This paper formalizes a behavior planning scheme in the presence of multiple possible futures with corresponding probabilities. In essence, we present a maximum entropy formulation and show how, under certain assumptions, this allows delayed decision-making to improve safety. The general formulation is then turned into a model predictive control formulation, which is solved as a quadratic program or a set of quadratic programs. We discuss implementation details for improving computation and present validation results in simulation and on a mobile robot.

08:35-08:40, Paper ThAT18.2	Add to My Program
Stochastic Trajectory Prediction under Unstructured Constraints

Ma, Hao	Institute of Automation, Chinese Academy of Sciences
Pu, Zhiqiang	University of Chinese Academy of Sciences; Institute of Automati
Wang, Shijie	Institute of Automation, Chinese Academy of Sciences
Liu, Boyin	University of Chinese Academy of Sciences School of Artificial I
Wang, Huimu	University of Chinese Academy of Sciences
Liang, Yanyan	Macau University of Science and Technology
Yi, Jianqiang	Chinese Academy of Sciences
Keywords: Constrained Motion Planning, Motion and Path Planning, Task and Motion Planning Abstract: Trajectory prediction facilitates effective planning and decision-making, while constrained trajectory prediction integrates regulation into prediction. Recent advances in constrained trajectory prediction focus on structured constraints by constructing optimization objectives. However, handling unstructured constraints is challenging due to the lack of differentiable formal definitions. To address this, we propose a novel method for constrained trajectory prediction using a conditional generative paradigm, named Controllable Trajectory Diffusion (CTD). The key idea is that any trajectory corresponds to a degree of conformity to a constraint. By quantifying this degree and treating it as a condition, a model can implicitly learn to predict trajectories under unstructured constraints. CTD employs a pre-trained scoring model to predict the degree of conformity (i.e., a score), and uses this score as a condition for a conditional diffusion model to generate trajectories. Experimental results demonstrate that CTD achieves high accuracy on the ETH/UCY and SDD benchmarks. Qualitative analysis confirms that CTD ensures adherence to unstructured constraints and can predict trajectories that satisfy combinatorial constraints.

08:40-08:45, Paper ThAT18.3	Add to My Program
A Control Barrier Function for Safe Navigation with Online Gaussian Splatting Maps

Chen, Timothy	Stanford University
Swann, Aiden	Stanford
Yu, Javier	Stanford University
Shorinwa, Ola	Stanford University
Murai, Riku	Imperial College London
Kennedy, Monroe	Stanford University
Schwager, Mac	Stanford University
Keywords: Collision Avoidance, Robot Safety, Mapping Abstract: SAFER-Splat (Simultaneous Action Filtering and Environment Reconstruction) is a real-time, scalable, and minimally invasive action filter, based on control barrier functions, for safe robotic navigation in a detailed map constructed at runtime using Gaussian Splatting (GSplat). We propose a novel Control Barrier Function (CBF) that not only induces safety with respect to all Gaussian primitives in the scene, but when synthesized into a controller, is capable of processing hundreds of thousands of Gaussians while maintaining a minimal memory footprint and operating at 15 Hz during online Splat training. Of the total compute time, a small fraction of it consumes GPU resources, enabling uninterrupted training. The safety layer is minimally invasive, correcting robot actions only when they are unsafe. To showcase the safety filter, we also introduce SplatBridge, an open-source software package built with ROS for real-time GSplat mapping for robots. We demonstrate the safety and robustness of our pipeline first in simulation, where our method is 20-50x faster, safer, and less conservative than competing methods based on neural radiance fields. Further, we demonstrate simultaneous GSplat mapping and safety filtering on a drone hardware platform using only on-board perception. We verify that under teleoperation a human pilot cannot invoke a collision. Our videos and codebase can be found at https://chengine.github.io/safer-splat.

08:45-08:50, Paper ThAT18.4	Add to My Program
A Skeleton-Based Topological Planner for Exploration in Complex Unknown Environments

Niu, Haochen	Shanghai Jiao Tong University
Ji, Xingwu	Shanghai Jiao Tong University
Zhang, Lantao	Shanghai Jiao Tong University
Wen, Fei	Shanghai Jiao Tong University
Ying, Rendong	Shanghai Jiao Tong University
Liu, Peilin	Shanghai Jiao Tong University
Keywords: Motion and Path Planning, Reactive and Sensor-Based Planning Abstract: The capability of autonomous exploration in complex, unknown environments is important in many robotic applications. While recent research on autonomous exploration have achieved much progress, there are still limitations, e.g., existing methods relying on greedy heuristics or optimal path planning are often hindered by repetitive paths and high computational demands.To address such limitations, we propose a novel exploration framework that utilizes the global topology information of observed environment to improve exploration efficiency while reducing computational overhead.Specifically, global information is utilized based on a skeletal topological graph representation of the environment geometry. We first propose an incremental skeleton extraction method based on wavefront propagation, based on which we then design an approach to generate a lightweight topological graph that can effectively capture the environment's structural characteristics. Building upon this, we introduce a finite state machine that leverages the topological structure to efficiently plan coverage paths, which can substantially mitigate the back-and-forth maneuvers (BFMs) problem. Experimental results demonstrate the superiority of our method in comparison with state-of-the-art methods. The source code will be made publicly available at: url{https://github.com/Haochen-Niu/STGPlanner}.

08:50-08:55, Paper ThAT18.5	Add to My Program
Safety-Critical Online Quadrotor Trajectory Planner for Agile Flights in Unknown Environments

Yuan, Jiazhe	Zhejiang University
Cao, Dongcheng	Zhejiang University
Mei, Jiahao	Zhejiang University of Technology
Chen, Jiming	Zhejiang University
Li, Shuo	Zhejiang University
Keywords: Motion and Path Planning, Collision Avoidance, Aerial Systems: Mechanics and Control Abstract: Autonomous high-speed flight in unknown, cluttered environments is essential for a variety of quadrotor applications, such as inspection, search, and rescue. In this study, we propose a novel trajectory planner designed to achieve efficient, high-speed, collision-free flights in such environments. The proposed approach begins by generating a safe flight corridor based on the path found by Lazy Theta*, representing the safe regions with polytopic sets. These sets are then used to define discrete-time control barrier function (DCBF), ensuring the quadrotor stays within safe bounds during flight. By selecting one single waypoint ahead of the quadrotor on the path as the next waypoint, the trajectory is optimized by considering both the total flight time and safety constraints. Extensive simulations and real-world experiments have confirmed our method's feasibility, demonstrating its capability for high-speed performance and reliable obstacle avoidance.

08:55-09:00, Paper ThAT18.6	Add to My Program
Anytime Replanning of Robot Coverage Paths for Partially Unknown Environments

Ramesh, Megnath	University of Waterloo
Imeson, Frank	Avidbots
Fidan, Baris	University of Waterloo
Smith, Stephen L.	University of Waterloo
Keywords: Coverage Path Planning, Motion and Path Planning, Reactive and Sensor-Based Planning, Service Robots Abstract: In this paper, we propose a method to replan coverage paths for a robot operating in an environment with initially unknown static obstacles. Existing coverage approaches reduce coverage time by covering along the minimum number of coverage lines (straight-line paths). However, recomputing such paths online can be computationally expensive resulting in robot stoppages that increase coverage time. A naive alternative is greedy detour replanning, i.e., replanning with minimum deviation from the initial path, which is efficient to compute but may result in unnecessary detours. In this work, we propose an anytime coverage replanning approach named OARP-Replan that performs near-optimal replans to an interrupted coverage path within a given time budget. We do this by solving linear relaxations of integer linear programs (ILPs) to identify sections of the interrupted path that can be optimally replanned within the time budget. We validate OARP-Replan in simulation and perform comparisons against a greedy detour replanner and other state-of-the-art coverage planners. We also demonstrate OARP-Replan in experiments using an industrial-level autonomous robot.


ThAT19 Regular Session, 407	Add to My Program
Tactile Sensing 3

Chair: She, Yu	Purdue University
Co-Chair: Hipwell, M Cynthia	Texas A&M Univeristy

08:30-08:35, Paper ThAT19.1	Add to My Program
LeTac-MPC: Learning Model Predictive Control for Tactile-Reactive Grasping

Xu, Zhengtong	Purdue University
She, Yu	Purdue University
Keywords: Force and Tactile Sensing, Grasping, Perception for Grasping and Manipulation, Sensor-based Control Abstract: Grasping is a crucial task in robotics, necessitating tactile feedback and reactive grasping adjustments for robust grasping of objects under various conditions and with differing physical properties. In this article, we introduce LeTac-MPC, a learning-based model predictive control (MPC) for tactile-reactive grasping. Our approach enables the gripper to grasp objects with different physical properties on dynamic and force-interactive tasks. We utilize a vision-based tactile sensor, GelSight (Yuan et al. 2017), which is capable of perceiving high-resolution tactile feedback that contains information on the physical properties and states of the grasped object. LeTac-MPC incorporates a differentiable MPC layer designed to model the embeddings extracted by a neural network from tactile feedback. This design facilitates convergent and robust grasping control at a frequency of 25 Hz. We propose a fully automated data collection pipeline and collect a dataset only using standardized blocks with different physical properties. However, our trained controller can generalize to daily objects with different sizes, shapes, materials, and textures. The experimental results demonstrate the effectiveness and robustness of the proposed approach. We compare LeTac-MPC with two purely model-based tactile-reactive controllers (MPC and PD) and open-loop grasping. Our results show that LeTac-MPC has optimal performance in dynamic and force-interactive tasks and optimal generalizability.

08:35-08:40, Paper ThAT19.2	Add to My Program
The Role of Tactile Sensing for Learning Reach and Grasp

Zhang, Boya	University of Tübingen
Andrussow, Iris	Max-Planck-Institute for Intelligent Systems
Zell, Andreas	University of Tübingen
Martius, Georg	Max Planck Institute for Intelligent Systems
Keywords: Reinforcement Learning, Force and Tactile Sensing, Deep Learning in Grasping and Manipulation Abstract: Stable and robust robotic grasping is essential for current and future robot applications. In recent works, the use of large datasets and supervised learning has enhanced speed and precision in antipodal grasping. However, these methods struggle with perception and calibration errors due to large planning horizons. To obtain more robust and reactive grasping motions, leveraging reinforcement learning combined with tactile sensing is a promising direction. Yet, there is no systematic evaluation of how the complexity of force-based tactile sensing affects the learning behavior for grasping tasks. This paper compares various tactile and environmental setups using two model-free reinforcement learning approaches for antipodal grasping. Our findings suggest that under imperfect visual perception, various tactile features improve learning outcomes, while complex tactile inputs complicate training.

08:40-08:45, Paper ThAT19.3	Add to My Program
Task-Specific Embodied Tactile Sensing for Dexterous Hand

Wei, Qi	Nanchang University
Xiong, Pengwen	Nanchang University
Song, Aiguo	Southeast University
Li, Qiang	Shenzhen Technology University
Keywords: Haptics and Haptic Interfaces, Embodied Cognitive Science, Behavior-Based Systems Abstract: In order to obtain a good tactile sensing, traditional dexterous hands always enable all the sensing units installed on them all the time, even if just a few sensor units are actually used, which make the tactile sensing system resource-wasting and energy consuming. In order to reduce their complexities by placing the tactile sensing units only at critical locations, this work proposes an embodied tactile dexterous hand (ET-Hand) and a novel multimodal sensor placement framework that learns multiple tasks to generate optimal placement proposal. Furthermore, our ET-Hand can dynamically adjust the perceived tactile sensor positions, types and numbers during robotic manipulation, providing novel tools and methods for investigating the tactile channels and placement scale required for robot exploration. In the object recognition and slip detection tasks, the results show that our proposed method performs close to or even better than traditional sensing way with large-scale placement.

08:45-08:50, Paper ThAT19.4	Add to My Program
TacDiffusion: Force-Domain Diffusion Policy for Precise Tactile Manipulation

Wu, Yansong	Technische Universität München
Chen, Zongxie	Technical University of Munich
Wu, Fan	Technical University of Munich
Chen, Lingyun	Technical University of Munich
Zhang, Liding	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Dexterous Manipulation, Assembly, Learning from Demonstration Abstract: Assembly is a crucial skill for robots in both modern manufacturing and service robotics. However, mastering transferable insertion skills that can handle a variety of high-precision assembly tasks remains a significant challenge. This paper presents a novel framework that utilizes diffusion models to generate 6D wrench for high-precision tactile robotic insertion tasks. It learns from demonstrations performed on a single task and achieves a zero-shot transfer success rate of 95.7% across various novel high-precision tasks. Our method effectively inherits the self-adaptability demonstrated by our previous work. In this framework, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Furthermore, we provide a practical guideline regarding the trade-off between diffusion models' inference ability and speed.

08:50-08:55, Paper ThAT19.5	Add to My Program
UpViTaL: Unpaired Visual-Tactile Self-Supervised Representation Learning for Dexterous Robotic Manipulation

Han, Guwen	Zhejiang University
Liu, Qingtao	Zhejiang University
Cui, Yu	Zhejiang University
Chen, Anjun	Zhejiang University
Chen, Jiming	Zhejiang University
Ye, Qi	Zhejiang University
Keywords: Dexterous Manipulation, Representation Learning, Reinforcement Learning Abstract: Visual and tactile pretraining have been extensively studied in dexterous robot manipulation tasks. However, existing methods typically require the simultaneous acquisition of visual and tactile data, making it difficult to utilize low-cost, unpaired visual-tactile datasets. Moreover, these methods often rely on tactile sensors to provide input data for reinforcement learning (RL) during the physical deployment of robotic dexterous hands, which highly increases deployment costs. To address these challenges, we propose UpViTaL, an unpaired visual- tactile self-supervised representation learning method for RL- based robot dexterous manipulation. Specifically, we collect low-cost unpaired visual and tactile datasets for manipulation skill learning using a camera and tactile gloves on three robot manipulation tasks. The temporal tactile self-supervised representation learning module of UpViTaL is used to explore efficient tactile representations from time-series tactile data. In parallel, the visual pretraining module of UpViTaL helps to extract efficient visual representations from visual data. In addition, we fuse unpaired visual-tactile representations through an RL reward mechanism, which does not require robotic dexterous hands tactile sensors for practical deployment. We validate our approach on three dexterous robot manipulation tasks. Experimental results demonstrate that UpViTaL can efficiently learn robot manipulation skills. Compared to existing approaches for visual pretraining, our method significantly improves the success rate by more than 30%.


ThAT20 Regular Session, 408	Add to My Program
Acceptability and Trust

Chair: de Graaf, Maartje	Utrecht University
Co-Chair: Doshi, Prashant	University of Georgia

08:30-08:35, Paper ThAT20.1	Add to My Program
Trust-Preserved Human-Robot Shared Autonomy Enabled by Bayesian Relational Event Modeling

Li, Yingke	Massachusetts Institute of Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Keywords: Acceptability and Trust, Human-Robot Teaming, Probability and Statistical Methods Abstract: Shared autonomy functions as a flexible framework that empowers robots to operate across a spectrum of autonomy levels, allowing for efficient task execution with minimal human oversight. However, humans might be intimidated by the autonomous decision-making capabilities of robots due to perceived risks and a lack of trust. This paper proposed a trust-preserved shared autonomy strategy that allows robots to seamlessly adjust their autonomy level, striving to optimize team performance and enhance their acceptance among human collaborators. By enhancing the relational event modeling framework with Bayesian learning techniques, this paper enables dynamic inference of human trust based solely on time-stamped relational events communicated within human-robot teams. Adopting a longitudinal perspective on trust development and calibration in human-robot teams, the proposed trust-preserved shared autonomy strategy warrants robots to actively establish, maintain, and repair human trust, rather than merely passively adapting to it. We validate the effectiveness of the proposed approach through a user study on a human-robot collaborative search and rescue scenario. The objective and subjective evaluations demonstrate its merits on both task execution and user acceptability over the baseline approach that does not consider the preservation of trust.

08:35-08:40, Paper ThAT20.2	Add to My Program
Fostering Trust through Gesture and Voice-Controlled Robot Trajectories in Industrial Human-Robot Collaboration

Campagna, Giulio	Aalborg University
Frommel, Christoph	German Aerospace Center
Haase, Tobias	German Aerospace Center (DLR)
Gottardi, Alberto	University of Padova
Villagrossi, Enrico	Italian National Research Council
Chrysostomou, Dimitrios	Aalborg University
Rehm, Matthias	Aalborg University
Keywords: Human Factors and Human-in-the-Loop, Acceptability and Trust, Human-Robot Collaboration Abstract: In the Industry 5.0 era, the focus shifts from basic automation to fostering collaboration between humans and robots. Trust is crucial in this new paradigm, enabling smooth interaction, especially for users with limited robotics knowledge. This study presents a novel framework that uses human hand gestures and voice commands to control robot movements, aiming to enhance trust, reduce cognitive workload, and minimize task execution time—key for efficient manufacturing. In automated systems, swift completion of micromanagement tasks is essential to prevent process disruption. To evaluate this framework, we devised a testbed scenario within an automated carbon fiber transportation and draping process, focusing on a maintenance task as the micromanagement challenge. Participants inspected the gripper, guided the robot along a defined path, and performed maintenance, such as attaching cables. Two conditions were tested: gestures and voice commands versus a smartPAD. The results showed that gestures and voice commands increased trust, lowered cognitive load, and shortened execution times, improving overall manufacturing efficiency.

08:40-08:45, Paper ThAT20.3	Add to My Program
Would You Trust Me Now? a Study on Trust Repair Strategies in Human-Robot Collaboration

Mélot-Chesnel, Joséphine	Utrecht University
de Graaf, Maartje	Utrecht University
Keywords: Acceptability and Trust, Design and Human Factors, Human-Robot Collaboration Abstract: As robots are prone to make errors that undermine trust, effective trust repair strategies are essential in effective human-robot collaboration. Our lab study evaluates three trust repair strategies --apology, denial, and compensation-- following two types of trust violations: competence-based and integrity-based. Consistent with prior research, integrity-based violations reduced moral trust more, while competence-based violations impacted performance trust. Denial caused greater discomfort than apology or compensation across both violation types. Dispositional trust influenced repair strategies effectiveness, particularly in willingness to engage and re-engage. Notably, individuals with high dispositional trust were more receptive to apologies. These findings underscore the need to consider individual trust differences, suggesting robots should assess human trust disposition to effectively foster continued collaboration.

08:45-08:50, Paper ThAT20.4	Add to My Program
Using Physiological Measures, Gaze, and Facial Expressions to Model Human Trust in a Robot Partner

Green, Haley N.	University of Virginia
Iqbal, Tariq	University of Virginia
Keywords: Acceptability and Trust Abstract: With robots becoming increasingly prevalent in various domains, it has become crucial to equip them with tools to achieve greater fluency in interactions with humans. One of the promising areas for further exploration lies in human trust. A real-time, objective model of human trust could be used to maximize productivity, preserve safety, and mitigate failure. In this work, we attempt to use physiological measures, gaze, and facial expressions to model human trust in a robot partner. We are the first to design an in-person, human-robot supervisory interaction study to create a dedicated trust dataset. Using this dataset, we train machine learning algorithms to identify the objective measures that are most indicative of trust in a robot partner, advancing trust prediction in human-robot interactions. Our findings indicate that a combination of sensor modalities (blood volume pulse, electrodermal activity, skin temperature, and gaze) can enhance the accuracy of detecting human trust in a robot partner. Furthermore, the Extra Trees, Random Forest, and Decision Trees classifiers exhibit consistently better performance in measuring the person's trust in the robot partner. These results lay the groundwork for constructing a real-time trust model for human-robot interaction, which could foster more efficient interactions between humans and robots.

08:50-08:55, Paper ThAT20.5	Add to My Program
A Novel Computational Framework of Robot Trust for Human-Robot Teams

Nare, Bhavana	University of Georgia
Frericks, John Bradley	University of Georgia
Challa, Anusha	University of Georgia
Doshi, Prashant	University of Georgia
Johnsen, Kyle	University of Georgia
Keywords: Acceptability and Trust, Human-Robot Teaming Abstract: When humans collaborate, they form positive or negative experiences with each other. These experiences depend on various factors such as the individual's skills, abilities, and agency. In this paper, we consider human-robot collaborations and present a novel model of an autonomous robot's trust in humans based on the probability of the robot having a positive experience with the human. The model defines a dynamic trust-building process that translates into a computationally-accessible implementation. We hypothesize predictors of a positive experience with human teammates and derive trust in individual humans. As the interactions continue, team members develop an affinity toward each other. The robot's affinity towards humans can be viewed as kinship, and we also investigate how kinship affects trust and distrust. We present an algorithm for how the robot may use kinship-mediated trust in its decision-making, and demonstrate its use in simulated missions truly requiring human-robot collaboration.

08:55-09:00, Paper ThAT20.6	Add to My Program
Modeling Trust Dynamics in Robot-Assisted Delivery: Impact of Trust Repair Strategies

Mangalindan, Dong Hae	Michigan State University
Kandikonda, Karthik	Michigan State University
Rovira, Ericka	United States Military Academy, West Point, NY
Srivastava, Vaibhav	Michigan State University
Keywords: Acceptability and Trust, Human-Robot Collaboration, Design and Human Factors Abstract: With increasing efficiency and reliability, autonomous systems are becoming valuable assistants to humans in various tasks. In the context of robot-assisted delivery, we investigate how robot performance and trust repair strategies impact human trust. In this task, humans can choose to either send the robot to deliver autonomously or manually control it while handling a secondary task. The trust repair strategies examined include short and long explanations, apology and promise, and denial. Using data from human participants, we model human behavior using an Input-Output Hidden Markov Model (IOHMM) to capture the dynamics of trust and human action probabilities. Our findings indicate that humans are more likely to deploy the robot autonomously when their trust is high. Furthermore, state transition estimates show that long explanations are the most effective at repairing trust following a failure, while denial is most effective at preventing trust loss. We also demonstrate that the trust estimates generated by our model are isomorphic to self-reported trust values, making them interpretable. This model lays the groundwork for developing optimal policies that facilitate real-time adjustment of human trust in autonomous systems.


ThAT21 Regular Session, 410	Add to My Program
Manipulation Planning and Control 1

Chair: Kim, Keehoon	POSTECH, Pohang University of Science and Technology
Co-Chair: Pang, Tao	Boston Dynamics AI Institute

08:30-08:35, Paper ThAT21.1	Add to My Program
Planning for Tabletop Object Rearrangement

Hu, Jiaming	UC San Diego
Szczekulski, Jan	University of California San Diego
Peddabomma, Sudhansh	University of California San Diego
Christensen, Henrik Iskov	UC San Diego
Keywords: Manipulation Planning, Mobile Manipulation Abstract: Finding an high-quality solution for the tabletop object rearrangement planning is a challenging problem. Compared to determining a goal arrangement, rearrangement planning is challenging due to the dependencies between objects and the buffer capacity available to hold objects. Although ORLA* has proposed an A* based searching strategy with lazy evaluation for the optimal solution, it is not scalable, with the success rate decreasing as the number of objects increases. Additionally, for noisy state representations, ORLA* provides only suboptimal solutions. To overcome these limitations, we propose an enhanced A-based algorithm that improves state representation and employs incremental goal attempts with lazy evaluation at each iteration. This approach aims to enhance scalability while maintaining solution quality. Our evaluation demonstrates that our algorithm can provide superior solutions compared to ORLA, in a shorter time, for both stationary and mobile robots.

08:35-08:40, Paper ThAT21.2	Add to My Program
DA-VIL: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control

Karim, Md Faizal	IIIT Hyderabad
Bollimuntha, Shreya	International Institute of Information Technology Hyderabad
Hashmi, Mohammed Saad	International Institute of Information Technology Hyderabad
Das, Autrio	International Institute of Information Technology Hyderabad
Singh, Gaurav	IIIT Hyderabad
Sridhar, Srinath	Brown University
Singh, Arun Kumar	University of Tartu
Govindan, Nagamanikandan	IIITDM Kancheepuram
Krishna, Madhava	IIIT Hyderabad
Keywords: Dual Arm Manipulation, Compliance and Impedance Control Abstract: Dual-arm manipulation is an area of growing interest in the robotics community. Enabling robots to perform tasks that require the coordinated use of two arms, is essential for complex manipulation tasks such as handling complex large objects, assembling components, and performing human-like interactions. However, achieving effective dual-arm manipulation is challenging due to the need for precise coordination, dynamic adaptability, and the ability to manage interaction forces between the arms and the objects being manipulated. We propose a novel pipeline that combines advantages of policy learning based on environment feedback and gradient based optimization to learn controller gains as well as the control outputs. This allows the robotic system to dynamically modulate its impedance in response to task demands, ensuring stability and dexterity in dual-arm operations. We evaluate our pipeline on a trajectory-tracking task involving a variety of large, complex objects with different masses and geometries. The performance is then compared to three other established methods for controlling dual-arm robots, demonstrating superior results.

08:40-08:45, Paper ThAT21.3	Add to My Program
Goal-Driven Robotic Pushing Manipulation under Uncertain Object Properties

Lee, Yongseok	Pohang University of Science and Technology
Kim, Keehoon	POSTECH, Pohang University of Science and Technology
Keywords: Dexterous Manipulation, Manipulation Planning, Model Learning for Control Abstract: Robotic pushing is one of the intuitive non-prehensile manipulation skills that can handle ungraspable objects without any complex task-specific tools. In this paper, we proposed a goal-driven accurate robotic pushing framework to achieve the robotic pushing tasks in practice that can operate under uncertain object properties. We employed a model predictive path integral (MPPI) as a goal-driven pushing controller building upon our prior work to operate pushing tasks under uncertain object properties. Unlike our prior work, the proposed framework can push the object toward the goal pose without predefined trajectories. The results of the numerical experiments demonstrated that the proposed framework can accomplish the pushing task with a significantly shorter total length, smaller total step, and a higher success rate even though the model parameters are unknown. Moreover, we demonstrated the proposed framework also works well in the real world through real-robot demonstrations.

08:45-08:50, Paper ThAT21.4	Add to My Program
Synthesizing Grasps and Regrasps for Complex Manipulation Tasks

Patankar, Aditya	Stony Brook University
Mahalingam, Dasharadhan	Stony Brook University
Chakraborty, Nilanjan	Stony Brook University
Keywords: Grasping, Manipulation Planning Abstract: In complex manipulation tasks, e.g., manipulation by pivoting, the motion of the object being manipulated has to satisfy path constraints that can change during the motion. Therefore, a single grasp may not be sufficient for the entire path, and the object may need to be regrasped. Additionally, geometric data for objects from a sensor are usually available in the form of point clouds. The problem of computing grasps and regrasps from point-cloud representation of objects for complex manipulation tasks is a key problem in endowing robots with manipulation capabilities beyond pick-and-place. In this paper, we formalize the problem of grasping/regrasping for complex manipulation tasks with objects represented by (partial) point clouds and present an algorithm to solve it. We represent a complex manipulation task as a sequence of constant screw motions. Using a manipulation plan skeleton as a sequence of constant screw motions, we use a grasp metric to find graspable regions on the object for every constant screw segment. The overlap of the graspable regions for contiguous screws are then used to determine when and how many times the object needs to be regrasped. We present experimental results on point cloud data collected from RGB-D sensors to illustrate our approach.

08:50-08:55, Paper ThAT21.5	Add to My Program
A Helping (Human) Hand in Kinematic Structure Estimation

Pfisterer, Adrian	Technische Universitaet Berlin
Li, Xing	TU Berlin
Mengers, Vito	Technische Universität Berlin
Brock, Oliver	Technische Universität Berlin
Keywords: RGB-D Perception, Probability and Statistical Methods, Learning from Demonstration Abstract: Visual uncertainties such as occlusions, lack of texture, and noise present significant challenges in obtaining accurate kinematic models for safe robotic manipulation. We introduce a probabilistic real-time approach that leverages the human hand as a prior to mitigate these uncertainties. By tracking the constrained motion of the human hand during manipulation and explicitly modeling uncertainties in visual observations, our method reliably estimates an object’s kinematic model online. We validate our approach on a novel dataset featuring challenging objects that are occluded during manipulation and offer limited articulations for perception. The results demonstrate that by incorporating an appropriate prior and explicitly accounting for uncertainties, our method produces accurate estimates, outperforming two recent baselines by 195% and 140%, respectively. Furthermore, we demonstrate that our approach's estimates are precise enough to allow a robot to manipulate even small objects safely.

08:55-09:00, Paper ThAT21.6	Add to My Program
Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans?

Shirai, Yuki	Mitsubishi Electric Research Laboratories
Zhao, Tong	Massachusetts Institute of Technology
Suh, Hyung Ju Terry	Massachusetts Institute of Technology
Zhu, Huaijiang	New York University
Ni, Xinpei	Georgia Institute of Technology
Wang, Jiuguang	Boston Dynamics AI Institute
Simchowitz, Max	MIT
Pang, Tao	Boston Dynamics AI Institute
Keywords: Dexterous Manipulation, Multi-Contact Whole-Body Motion Planning and Control, Optimization and Optimal Control Abstract: Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans.


ThAT22 Regular Session, 411	Add to My Program
Learning for Manipulation and Navigation

Chair: Sintov, Avishai	Tel-Aviv University
Co-Chair: Kingston, Zachary	Purdue University

08:30-08:35, Paper ThAT22.1	Add to My Program
Interaction-Driven Updates: 3D Scene Graph Maintenance During Robot Task Execution

Li, Qingfeng	Beihang University
Zhang, Xinlei	BUAA
Chen, Chen	Hangzhou Innovation Institute of Beihanga University
Niu, Jianwei	Beihang University
Zhao, Haochen	BUAA
Keywords: Semantic Scene Understanding, Cognitive Control Architectures, Embodied Cognitive Science Abstract: Robots powered by large language model (LLM) demonstrate significant research and application potential by effectively interpreting scene information to respond to human commands. However, when robots rely on static scene information during task execution, they face difficulties in adapting to changes in the environment, posing a major challenge for dynamic scene perception. To address the above issues, we propose an innovative interaction-driven approach to enhance robots' ability to perceive dynamic scene information. This approach consists of two contributions, the observation point selection module and the dynamic scene maintenance module. Specifically, first, the robot uses the 3D scene graph (3DSG) containing assets and objects to perceive static scene information through the LLM planner. Next, the best observation point for each asset is obtained through the observation point selection module. Then, with the help of the best observation point, the dynamic scene maintenance module interacts with the asset-related objects to dynamically update all the object node information related to the asset node. This approach enables robots to maintain dynamic scene information, enhancing their adaptability in unpredictable environments and improving task reliability.We evaluated our method using the iTHOR and RoboTHOR datasets within the AI2-THOR simulator and in real-world scenarios. Experimental results demonstrate that our method effectively and accurately maintains robots' perception of dynamic scene information.

08:35-08:40, Paper ThAT22.2	Add to My Program
ME-PATS: Mutually Enhancing Search-Based Planner and Learning-Based Agent for Tractor-Trailer Systems

Fan, Ke	Tsinghua University
Ren, Zhizhou	Tsinghua University
Guo, Ruihan	Helixon
Zhang, Jinpeng	Tsinghua University
Huang, Zhuo	Tsinghua University
Zhou, Yuan	Tsinghua University
Zhang, Zufeng	Tsinghua University
Keywords: Motion and Path Planning, AI-Based Methods, Integrated Planning and Learning Abstract: Planning a kinodynamically feasible path for a tractor-trailer vehicle is challenging for both search-based and learning-based methods due to the vehicle’s unique kinematics and complex obstacles. These factors increase the likelihood of infeasible paths and exacerbate long-horizon issues. We introduce ME-PATS: a framework that mutually enhances the search-based planner and the learning-based agent for tractor-trailer systems. The search-based planner provides successful trajectories to help the learning-based agent update its policy, while the agent improves the planner’s efficiency through direct path simulation. Additionally, we propose two approaches to apply our framework to more challenging tasks: designing obstacle-aware networks to enhance the learning-based agent’s capabilities, and combining the planner’s paths with the trained agent’s simulated paths through multi-segment integration. Full details and results are available on our project website at href{https://github.com/FrankSinatral/TTsystems}{https://g ithub.com/FrankSinatral/TTsystems}.

08:40-08:45, Paper ThAT22.3	Add to My Program
Jailbreaking LLM-Controlled Robots

Robey, Alexander	University of Pennsylvania
Ravichandran, Zachary	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Hassani, Hamed	University of Pennsylvania
Pappas, George J.	University of Pennsylvania
Keywords: AI-Enabled Robotics, Robot Safety, Machine Learning for Robot Control Abstract: The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that RoboPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org.

08:45-08:50, Paper ThAT22.4	Add to My Program
CaStL: Constraints As Specifications through LLM Translation for Long-Horizon Task and Motion Planning

Guo, Weihang	Rice University
Kingston, Zachary	Purdue University
Kavraki, Lydia	Rice University
Keywords: AI-Enabled Robotics, Task and Motion Planning Abstract: Large Language Models (LLMs) have demonstrated remarkable ability in long-horizon Task and Motion Planning (TAMP) by translating clear and straightforward natural language problems into formal specifications such as the Planning Domain Definition Language (PDDL). However, real-world problems are often ambiguous and involve many complex constraints. In this paper, we introduce Constraints as Specifications through LLMs (CaStL), a framework that identifies constraints such as goal conditions, action ordering, and action blocking from natural language in multiple stages. CaStL translates these constraints into PDDL and Python scripts, which are solved using an custom PDDL solver. Tested across three PDDL domains, CaStL significantly improves constraint handling and planning success rates from natural language specification in complex scenarios.

08:50-08:55, Paper ThAT22.5	Add to My Program
Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

Verghese, Mrinal	Carnegie Mellon University
Atkeson, Christopher	CMU
Keywords: AI-Enabled Robotics, Learning from Demonstration, Big Data in Robotics and Automation Abstract: This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79% success rate on a set of 16 different cooking skills involving tool-use.

08:55-09:00, Paper ThAT22.6	Add to My Program
LEMMo-Plan: LLM-Enhanced Learning from Mutli-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks

Chen, Kejia	Technical University of Munich
Shen, Zheng	TU Munich
Zhang, Yue	Technical University of Munich
Chen, Lingyun	Technical University of Munich
Wu, Fan	Technical University of Munich
Bing, Zhenshan	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: AI-Enabled Robotics, Dexterous Manipulation, Compliant Assembly Abstract: Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance the LLM's ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.


ThAT23 Regular Session, 412	Add to My Program
Diffusion Models

Chair: Romeres, Diego	Mitsubishi Electric Research Laboratories
Co-Chair: Gombolay, Matthew	Georgia Institute of Technology

08:30-08:35, Paper ThAT23.1	Add to My Program
LTLDoG: Satisfying Temporally-Extended Symbolic Constraints for Safe Diffusion-Based Planning

Feng, Zeyu	National University of Singapore
Luan, Hao	National University of Singapore
Goyal, Pranav	University of Michigan - Ann Arbor
Soh, Harold	National University of Singapore
Keywords: Imitation Learning, Machine Learning for Robot Control, Safety in HRI Abstract: Operating effectively in complex environments while complying with specified constraints is crucial for the safe and successful deployment of robots that interact with and operate around people. In this work, we focus on generating long-horizon trajectories that adhere to novel static and temporally-extended constraints/instructions at test time. We propose a data-driven diffusion-based framework, LTLDoG, that modifies the inference steps of the reverse process given an instruction specified using finite linear temporal logic (LTLf). LTLDoG leverages a satisfaction value function on LTLf and guides the sampling steps using its gradient field. This value function can also be trained to generalize to new instructions not observed during training, enabling flexible test-time adaptability. Experiments in robot navigation and manipulation illustrate that the method is able to generate trajectories that satisfy formulae that specify obstacle avoidance and visitation sequences.

08:35-08:40, Paper ThAT23.2	Add to My Program
DARE: Diffusion Policy for Autonomous Robot Exploration

Cao, Yuhong	National University of Singapore
Lew, Jeric Jieyi	National University of Singapore
Liang, Jingsong	National University of Singapore
Cheng, Jin	ETH Zurich
Sartoretti, Guillaume Adrien	National University of Singapore (NUS)
Keywords: View Planning for SLAM, Deep Learning Methods, Motion and Path Planning Abstract: Autonomous robot exploration requires a robot to efficiently explore and map unknown environments. Compared to conventional methods that can only optimize paths based on the current robot belief, learning-based methods show the potential to achieve improved performance by drawing on past experiences to reason about unknown areas. In this paper, we propose DARE, a novel generative approach that leverages diffusion models trained on expert demonstrations, which can explicitly generate an exploration path through one-time inference. We build DARE upon an attention-based encoder and a diffusion model, and introduce ground truth optimal demonstrations for training to learn better patterns for exploration. The trained planner can reason about the partial belief to recognize the potential structure in unknown areas and consider these areas during path planning. Our experiments demonstrate that DARE achieves on-par performance with both conventional and learning-based state-of-the-art exploration planners, as well as good generalizability in both simulations and real-life scenarios.

08:40-08:45, Paper ThAT23.3	Add to My Program
NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation

Zeng, Yiming	Sun Yat-Sen University
Ren, Hao	Sun Yat-Sen University
Wang, Shuhang	Sun Yet-Sen University
Huang, Junlong	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Vision-Based Navigation, Integrated Planning and Learning, Imitation Learning Abstract: Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at url{https://github.com/SYSU-RoboticsLab/NaviD}.

08:45-08:50, Paper ThAT23.4	Add to My Program
NavigateDiff: Visual Predictors Are Zero-Shot Navigation Assistants

Qin, Yiran	CUHKsz
Sun, Ao	The Chinese University of Hong Kong, Shenzhen
Hong, Yuze	The Chinese University of Hong Kong，Shenzhen
Wang, Benyou	The Chinese University of Hong Kong, Shenzhen
Zhang, Ruimao	The Chinese University of Hong Kong (Shenzhen)
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Visual Learning Abstract: Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named NavigateDiff constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goal-reaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings. Project Page: https://21styouth.github.io/NavigateDiff/.

08:50-08:55, Paper ThAT23.5	Add to My Program
FDPP: Fine-Tune Diffusion Policy with Human Preference

Chen, Yuxin	University of California, Berkeley
Jha, Devesh	Mitsubishi Electric Research Laboratories
Tomizuka, Masayoshi	University of California
Romeres, Diego	Mitsubishi Electric Research Laboratories
Keywords: Imitation Learning, Reinforcement Learning, Sensorimotor Learning Abstract: Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback–Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.

08:55-09:00, Paper ThAT23.6	Add to My Program
Learning Diverse Robot Striking Motions with Diffusion Models and Kinematically Constrained Gradient Guidance

Lee, Kin Man	Georgia Institute of Technology
Ye, Sean	Zoox
Xiao, Qingyu	Georgia Institute of Technology
Wu, Zixuan	Georgia Institute of Technology
Zaidi, Zulfiqar	Georgia Institute of Technology
D'Ambrosio, David	Google
Sanketi, Pannag	Google
Gombolay, Matthew	Georgia Institute of Technology
Keywords: Imitation Learning, Learning from Demonstration, Constrained Motion Planning Abstract: Advances in robot learning have enabled robots to generate skills for a variety of tasks. Yet, robot learning is typically sample inefficient, struggles to learn from data sources exhibiting varied behaviors, and does not naturally incorporate constraints. These properties are critical for fast, agile tasks such as playing table tennis. Modern techniques for learning from demonstration improve sample efficiency and scale to diverse data, but are rarely evaluated on agile tasks. In the case of reinforcement learning, achieving good performance requires training on high-fidelity simulators. To overcome these limitations, we develop a novel diffusion modeling approach that is offline, constraint-guided, and expressive of diverse agile behaviors. The key to our approach is a kinematic constraint gradient guidance (KCGG) technique that computes gradients through both the forward kinematics of the robot arm and the diffusion model to direct the sampling process. KCGG minimizes the cost of violating constraints while simultaneously keeping the sampled trajectory in-distribution of the training data. We demonstrate the effectiveness of our approach for time-critical robotic tasks by evaluating KCGG in two challenging domains: simulated air hockey and real table tennis. In simulated air hockey, we achieved a 25.4% increase in block rate, while in table tennis, we achieved a 17.3% increase in success rate compared to imitation learning baselines.


ThBT1 Regular Session, 302	Add to My Program
Planning and Simulation

Chair: Yoshida, Kazuya	Tohoku University
Co-Chair: Dantam, Neil	Colorado School of Mines

09:55-10:00, Paper ThBT1.1	Add to My Program
Guarantees on Robot System Performance Using Stochastic Simulation Rollouts

Vincent, Joseph	Stanford University
Feldman, Aaron	Stanford University
Schwager, Mac	Stanford University
Keywords: Probability and Statistical Methods, Optimization and Optimal Control, Motion and Path Planning, Risk-Sensitive Control Abstract: We provide finite-sample performance guarantees for control policies executed on stochastic robotic systems. Given an open- or closed-loop policy and a finite set of trajectory rollouts under the policy, we bound the expected value, value-at-risk, and conditional-value-at-risk of the trajectory cost, and the probability of failure in a sparse cost setting. The bounds hold, with user-specified probability, for any policy synthesis technique and can be seen as a post-design safety certification. Generating the bounds only requires sampling simulation rollouts, without assumptions on the distribution or complexity of the underlying stochastic system. We adapt these bounds to also give a constraint satisfaction test to verify safety of the robot system. We provide a thorough analysis of the bound sensitivity to sim-to-real distribution shifts and provide results for constructing robust bounds that can tolerate some specified amount of distribution shift. Furthermore, we extend our method to apply when selecting the best policy from a set of candidates, requiring a multi-hypothesis correction. We show the statistical validity of our bounds in the Ant, Half-cheetah, and Swimmer MuJoCo environments and demonstrate our constraint satisfaction test with the Ant. Finally, using the 20 degree-of-freedom MuJoCo Shadow Hand, we show the necessity of the multi-hypothesis correction.

10:00-10:05, Paper ThBT1.2	Add to My Program
In-Pipe Navigation Development Environment and a Smooth Path Planning Method on Pipeline Surface

Liu, Hao	Independent
Li, Xiang	The Lab for High Technology, Tsinghua University
Zhang, Xiang	Qylab
Liu, Gang	Tsinghua University
Lu, Mingquan	Tsinghua University
Keywords: Motion and Path Planning, Climbing Robots Abstract: Autonomous in-pipe inspection robots can automatically navigate through complex pipeline networks and detect potential risks from corrosion and defects, demonstrating great potential for replacing costly manual inspections. However, there is no publicly available simulation environment where researchers can validate their in-pipe navigation algorithms as far as we know, and the navigation algorithms on constrained 3D pipe surface which is the critical software component are less discussed. Firstly, this paper proposes an open-source In-Pipe Navigation Development Environment. It contains various pipeline models, a magnetic wheel climbing robot model realized by the adhesion plugin, and baseline algorithms for navigation tasks. Secondly, a novel effective path planning method is introduced. Instead of planning based on surface structures, the proposed method plans based on pipeline axis and maps it into local path using the Frenet-Serret formula, thereby generating smooth, feasible, and efficient paths. Finally, we conduct both qualitative and quantitative experiments in the proposed simulation and real-world environments. The results show the usability of the development environment, also robustness and efficiency of the proposed planning method.

10:05-10:10, Paper ThBT1.3	Add to My Program
Extended Friction Models for the Physics Simulation of Servo Actuators

Duclusaud, Marc	LaBRI - University of Bordeaux
Passault, Grégoire	LaBRI
Padois, Vincent	Inria Bordeaux
Ly, Olivier	LaBRI - Bordeaux University
Keywords: Simulation and Animation, Calibration and Identification Abstract: Accurate physical simulation is crucial for the development and validation of control algorithms in robotic systems. Recent works in Reinforcement Learning (RL) take notably advantage of extensive simulations to produce efficient robot control. State-of-the-art servo actuator models generally fail at capturing the complex friction dynamics of these systems. This limits the transferability of simulated behaviors to real-world applications. In this work, we present extended friction models that allow to more accurately simulate servo actuator dynamics. We propose a comprehensive analysis of various friction models, present a method for identifying model parameters using recorded trajectories from a pendulum test bench, and demonstrate how these models can be integrated into physics engines. The proposed friction models are validated on four distinct servo actuators and tested on 2R manipulators, showing significant improvements in accuracy over the standard Coulomb-Viscous model. Our results highlight the importance of considering advanced friction effects in the simulation of servo actuators to enhance the realism and reliability of robotic simulations.

10:10-10:15, Paper ThBT1.4	Add to My Program
Hierarchically Accelerated Coverage Path Planning for Redundant Manipulators

Wang, Yeping	University of Wisconsin-Madison
Gleicher, Michael	University of Wisconsin - Madison
Keywords: Motion and Path Planning, Industrial Robots Abstract: Many robotic applications, such as sanding, polishing, wiping and sensor scanning, require a manipulator to dexterously cover a surface using its end-effector. In this paper, we provide an efficient and effective coverage path planning approach that leverages a manipulator's redundancy and task tolerances to minimize costs in joint space. We formulate the problem as a Generalized Traveling Salesman Problem and hierarchically streamline the graph size. Our strategy is to identify guide paths that roughly cover the surface and accelerate the computation by solving a sequence of smaller problems. We demonstrate the effectiveness of our method through a simulation experiment and an illustrative demonstration using a physical robot.

10:15-10:20, Paper ThBT1.5	Add to My Program
Decentralized Safe and Scalable Multi-Agent Control under Limited Actuation

Zinage, Vrushabh	University of Texas at Austin
Jha, Abhishek	Delhi Technological University
Chandra, Rohan	University of Virginia
Bakolas, Efstathios	The University of Texas at Austin
Keywords: Integrated Planning and Control, Multi-Robot Systems Abstract: To deploy safe and agile robots in cluttered environments, there is a need to develop fully decentralized controllers that guarantee safety, respect actuation limits, prevent deadlocks, and scale to thousands of agents. Current approaches fall short of meeting all these goals: optimization-based methods ensure safety but lack scalability, while learning-based methods scale but do not guarantee safety. We propose a novel algorithm to achieve safe and scalable control for multiple agents under limited actuation. Specifically, our approach includes: (i) learning a decentralized neural Integral Control Barrier function (neural ICBF) for scalable, input-constrained control, (ii) embedding a lightweight decentralized Model Predictive Control-based Integral Control Barrier Function (MPC-ICBF) into the neural network policy to ensure safety while maintaining scalability, and (iii) introducing a novel method to minimize deadlocks based on gradient-based optimization techniques from machine learning to address local minima in deadlocks. Our numerical simulations show that this approach outperforms state-of-the-art multi-agent control algorithms in terms of safety, input constraint satisfaction, and minimizing deadlocks. Additionally, we demonstrate strong generalization across scenarios with varying agent counts, scaling up to 1000 agents.

10:20-10:25, Paper ThBT1.6	Add to My Program
Multi-Agent Collective Construction of General Modular Structures

Kostitsyna, Irina	KBR at NASA Ames Research Center
Cheung, Kenneth C.	National Aeronautics and Space Administration (NASA)
Gloyd, James	KBR Inc
Keywords: Motion and Path Planning, Parallel Robots, Robotics and Automation in Construction Abstract: We present an algorithmic framework for a multi-robot modular assembly system. Motivated by the prospects of in-space assembly, we focus on the NASA Automated Reconfigurable Mission Adaptive Digital Assembly Systems (ARMADAS) framework, in which multiple types of robots work together in a team to build large structures. Unlike with other multi-robot construction systems, the geometry of structures that ARMADAS robots can build is not limited to the class of histogram shapes. To address the intractability of path planning for a robot system with the exponentially growing number of dimensions, we present a decoupled planning approach, where the assembly and path planning is performed iteratively by one robot team at a time. We present a number of data structures which help us avoid collisions and deadlocks in the resulting robot schedule.

10:25-10:30, Paper ThBT1.7	Add to My Program
BPMP-Tracker: A Versatile Aerial Target Tracker Using Bernstein Polynomial Motion Primitives

Lee, Yunwoo	Seoul National University
Park, Jungwon	Seoul National University
Jeon, Boseong	Seoul National University
Jung, Seungwoo	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Visual Servoing, Reactive and Sensor-Based Planning, Motion and Path Planning Abstract: This letter presents a versatile trajectory planning pipeline for aerial tracking. The proposed tracker is capable of handling various chasing settings such as complex unstructured environments, crowded dynamic obstacles and multiple-target following. Among the entire pipeline, we focus on developing a predictor for future target motion and a chasing trajectory planner. For rapid computation, we employ the sample-check-select strategy: modules sample a set of candidate movements, check multiple constraints, and then select the best trajectory. Also, we leverage the properties of Bernstein polynomials for quick calculations. The prediction module predicts the trajectories of the targets, which do not overlap with static and dynamic obstacles. Then the trajectory planner outputs a trajectory, ensuring various conditions such as occlusion and collision avoidance, the visibility of all targets within a camera image and dynamical limits. We fully test the proposed tracker in simulations and hardware experiments under challenging scenarios, including dual-target following, environments with dozens of dynamic obstacles and complex indoor and outdoor spaces.


ThBT2 Regular Session, 301	Add to My Program
SLAM 6

Chair: Leonard, John	MIT
Co-Chair: Schmid, Lukas M.	Massachusetts Institute of Technology (MIT)

09:55-10:00, Paper ThBT2.1	Add to My Program
PIN-SLAM: LiDAR SLAM Using a Point-Based Implicit Neural Representation for Achieving Global Map Consistency

Pan, Yue	University of Bonn
Zhong, Xingguang	University of Bonn
Wiesmann, Louis	University of Bonn
Posewsky, Thorbjörn	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Keywords: SLAM, Mapping, Localization, Deep Learning Abstract: Accurate and robust localization and mapping are essential components for most autonomous robots. In this paper, we propose a SLAM system for building globally consistent maps, called PIN-SLAM, that is based on an elastic and compact point-based implicit neural map representation. Taking range measurements as input, our approach alternates between incremental learning of the local implicit signed distance field and the pose estimation given the current local map using a correspondence-free, point-to-implicit model registration. Our implicit map is based on sparse optimizable neural points, which are inherently elastic and deformable with the global pose adjustment when closing a loop. Loops are also detected using the neural point features. Extensive experiments validate that PIN-SLAM is robust to various environments and versatile to different range sensors such as LiDAR and RGB-D cameras. PIN-SLAM achieves pose estimation accuracy better or on par with the state-of-the-art LiDAR odometry or SLAM systems and outperforms the recent neural implicit SLAM approaches while maintaining a more consistent, and highly compact implicit map that can be reconstructed as accurate and complete meshes. Finally, thanks to the voxel hashing for efficient neural points indexing and the fast implicit map-based registration without closest point association, PIN-SLAM can run at the sensor frame rate on a moderate GPU.

10:00-10:05, Paper ThBT2.2	Add to My Program
Data-Driven Batch Localization and SLAM Using Koopman Linearization

Guo, Zi Cong	University of Toronto
Dümbgen, Frederike	ENS, PSL University
Forbes, James Richard	McGill University
Barfoot, Timothy	University of Toronto
Keywords: Localization, SLAM, Koopman, Model Learning for Control Abstract: We present a framework for model-free batch localization and SLAM. We use lifting functions to map a control-affine system into a high-dimensional space, where both the process model and the measurement model are rendered bilinear. During training, we solve a least-squares problem using groundtruth data to compute the high-dimensional model matrices associated with the lifted system purely from data. At inference time, we solve for the unknown robot trajectory and landmarks through an optimization problem, where constraints are introduced to keep the solution on the manifold of the lifting functions. The problem is efficiently solved using a sequential quadratic program (SQP), where the complexity of an SQP iteration scales linearly with the number of timesteps. Our algorithms, called Reduced Constrained Koopman Linearization Localization (RCKL-Loc) and Reduced Constrained Koopman Linearization SLAM (RCKL-SLAM), are validated experimentally in simulation and on two datasets: one with an indoor mobile robot equipped with a laser rangefinder that measures range to cylindrical landmarks, and one on a golf cart equipped with RFID range sensors. We compare RCKL-

10:05-10:10, Paper ThBT2.3	Add to My Program
Certifiably Correct Range-Aided SLAM

Papalia, Alan	Massachusetts Institute of Technology
Fishberg, Andrew	MIT
O'Neill, Brendan	WHOI/MIT
How, Jonathan	Massachusetts Institute of Technology
Rosen, David	Northeastern University
Leonard, John	MIT
Keywords: SLAM, Range Sensing, Optimization and Optimal Control, Certifiable Perception Abstract: We present the first algorithm to efficiently compute certifiably optimal solutions to range-aided simultaneous localization and mapping (RA-SLAM) problems. Robotic navigation systems increasingly incorporate point-to-point ranging sensors, leading to state estimation problems in the form of RA-SLAM. However, the RA-SLAM problem is significantly more difficult to solve than traditional pose-graph SLAM: ranging sensor models introduce non-convexity and single range measurements do not uniquely determine the transform between the involved sensors. As a result, RA-SLAM inference is sensitive to initial estimates yet lacks reliable initialization techniques. Our approach, certifiably correct RA-SLAM (CORA), leverages a novel quadratically constrained quadratic programming (QCQP) formulation of RA-SLAM to relax the RA-SLAM problem to a semidefinite program (SDP). CORA solves the SDP efficiently using the Riemannian Staircase methodology; the SDP solution provides both (i) a lower bound on the RA-SLAM problem's optimal value, and (ii) an approximate solution of the RA-SLAM problem, which can be subsequently refined using local optimization. CORA applies to problems with arbitrary pose-pose, pose-landmark, and ranging measurements and, due to using convex relaxation, is insensitive to initialization. We evaluate CORA on several real-world problems. In contrast to state-of-the-art approaches, CORA is able to obtain high-quality solutions on all problems despite being initialized with random values. Additionally, we study the tightness of the SDP relaxation with respect to important problem parameters: the number of (i) robots, (ii) landmarks, and (iii) range measurements. These experiments demonstrate that the SDP relaxation is often tight and reveal relationships between graph connectivity and the tightness of the SDP relaxation.

10:10-10:15, Paper ThBT2.4	Add to My Program
DiTer++: Diverse Terrain and Multi-Modal Dataset for Multi-Robot SLAM in Multi-Session Environments

Kim, Juwon	Dept. Electr. and Comput. Eng., Inha University, South Korea
Kim, Hogyun	Inha University
Jeong, Seokhwan	Inha University
Shin, Young-Sik	KIMM
Cho, Younggun	Inha University
Keywords: Data Sets for SLAM, Localization, Mapping Abstract: We encounter large-scale environments where both structured and unstructured spaces coexist, such as on campuses. In this environment, lighting conditions and dynamic objects change constantly. To tackle the challenges of largescale mapping under such conditions, we introduce DiTer++, a diverse terrain and multi-modal dataset designed for multirobot SLAM in multi-session environments. According to our datasets’ scenarios, Agent-A and Agent-B scan the area designated for efficient large-scale mapping day and night, respectively. Also, we utilize legged robots for terrain-agnostic traversing. To generate the ground-truth of each robot, we first build the survey-grade prior map. Then, we remove the dynamic objects and outliers from the prior map and extract the trajectory through scan-to-map matching. Our dataset and supplement materials are available at https://github.com/sparolab/DiTer-plusplus/.

10:15-10:20, Paper ThBT2.5	Add to My Program
CELLmap: Enhancing LiDAR SLAM through Elastic and Lightweight Spherical Map Representation

Duan, Yifan	University of Science and Technology of China
Zhang, Xinran	University of Science and Technology of China
Li, Yao	University of Science and Technology of China
You, Guoliang	University of Science and Technology of China
Chu, Xiaomeng	University of Scieonce and Technology of China
Ji, Jianmin	University of Science and Technology of China
Zhang, Yanyong	University of Science and Technology of China
Keywords: Mapping, SLAM Abstract: SLAM is a fundamental capability of unmanned systems, with LiDAR-based SLAM gaining widespread adop- tion due to its high precision. Current SLAM systems can achieve centimeter-level accuracy within a short period. How- ever, there are still several challenges when dealing with large- scale mapping tasks including significant storage requirements and difficulty of reusing the constructed maps. To address this, we first design an elastic and lightweight map representation called CELLmap, composed of several CELLs, each representing the local map at the corresponding location. Then, we design a general backend including CELL-based bidirectional regis- tration module and loop closure detection module to improve global map consistency. Our experiments have demonstrated that CELLmap can represent the precise geometric structure of large-scale maps of KITTI dataset using only about 60 MB. Additionally, our general backend achieves up to a 26.88% improvement over various LiDAR odometry methods.

10:20-10:25, Paper ThBT2.6	Add to My Program
A Benchmark Dataset for Collaborative SLAM in Service Environments

Park, Harin	UNIST
Lee, Inha	Ulsan National Institute of Science & Technology
Kim, Minje	Ulsan National Institute of Science & Technology
Park, Hyungyu	Ulsan National Institute of Science and Technology
Joo, Kyungdon	UNIST
Keywords: Data Sets for SLAM, Multi-Robot SLAM, Data Sets for Robotic Vision Abstract: We introduce a new multi-modal collaborative SLAM (C-SLAM) dataset for multiple service robots in various indoor service environments, called C-SLAM dataset in Service Environments (CSE). We use the NVIDIA Isaac Sim to generate data in various indoor service environments with the challenges that may occur in real-world service environments. By using the simulator, we can provide accurate and precisely time-synchronized sensor data, such as stereo RGB, stereo depth, IMU, and ground truth poses. We configure three common indoor service environments (Hospital, Office, and Warehouse), each of which includes various dynamic objects that perform motions suitable to each environment. In addition, we drive the three robots to mimic the actions of real service robots. Through these factors, we generate a realistic C-SLAM dataset for multiple service robots. We demonstrate our CSE dataset by evaluating diverse state-of-the-art single-robot SLAM and multi-robot SLAM methods. Our dataset will be available at https://github.com/vision3d-lab/CSE_Dataset.

10:25-10:30, Paper ThBT2.7	Add to My Program
A Consistent Parallel Estimation Framework for Visual-Inertial SLAM

Huai, Zheng	University of Delaware
Huang, Guoquan (Paul)	University of Delaware
Keywords: SLAM, Visual-Based Navigation, Sensor Fusion, Estimation Consistency Abstract: In this article, we revisit the optimal fusion of visual and inertial information from a monocular camera and an inertial measurement unit and propose a novel parallel visual-inertial simultaneous localization and mapping (SLAM) estimation framework in favor of the multithread computation on a single CPU. We start modeling the SLAM problem with a Bayesian batch estimator, and then split it into two submodules, localization and mapping, of different scales and processing rates, however, can thus run concurrently. The estimation consistency is taken into account in decoupling the two submodules so that when loop closure occurs the localization accuracy can seamlessly benefit from the mapping result via online global optimization, which distinguishes our solution from the others. To this end, we design the corresponding front-end and back-end to consistently solve localization and mapping in parallel, especially the hybrid robocentric and world-centric formulations are used for modeling the respective problems. We also demonstrate the effectiveness of the proposed method using both the synthetic data generated for Monte-Carlo simulations and diverse real datasets acquired in highly-dynamic, long-term, and large-scale SLAM scenarios. Simulation results validate the significantly improved consistency and accuracy by applying our method. Experimental results show the better (competitive at least) performance against a state-of-the-art method, while being capable of processing a huge amount of measurements in building large-scale maps without blocking the high-accuracy real-time localization outputs.


ThBT3 Regular Session, 303	Add to My Program
Pose Estimation

Chair: Caverly, Ryan James	University of Minnesota
Co-Chair: Anderson, Monica	The University of Alabama

09:55-10:00, Paper ThBT3.1	Add to My Program
Depth-Based Efficient PnP: A Rapid and Accurate Method for Camera Pose Estimation

Xie, Xinyue	Dalian University of Technology
Zou, Deyue	Dalian University of Technology
Keywords: SLAM, Vision-Based Navigation Abstract: This paper presents a novel approach, DEPnP (Depth-based Efficient PnP), addressing the Perspective-n-Point (PnP) problem crucial in vision-based navigation and SLAM (Simultaneous Localization and Mapping) in robotics and automation, which estimates the pose of a calibrated camera by observing the 2D projections of known 3D points onto the camera image plane. The method employs eight variables to control the depth of control points and orientation of camera, formulating camera pose estimation as an optimization task. By optimizing these variables utilizing mean-subtracted rotation equations, rapid and accurate camera pose estimation is achieved. Notably, the careful selection of variables and objective function simplifies the computation of the Jacobian matrix, ensuring computational efficiency. DEPnP demonstrates robustness against noise and inlier disturbances, consistently delivering accurate camera pose estimation. Experimental evaluations validate the effectiveness and accuracy of DEPnP, positioning it as a competitive solution for real-time applications requiring precise camera pose estimation in robotics and automation.

10:00-10:05, Paper ThBT3.2	Add to My Program
Kalman-Filter-Based Pose Estimation of Cable-Driven Parallel Robots Using Cable-Length Measurements with Colored Noise

Nguyen, Vinh	University of Minnesota
Caverly, Ryan James	University of Minnesota
Keywords: Parallel Robots, Kinematics, Sensor Fusion Abstract: This paper introduces a cable-length-based extended Kalman filter (L-EKF) framework to estimate the end-effector pose of a cable-driven parallel robot (CDPR). The L-EKF fuses end-effector accelerometer and rate gyroscope measurements with cable-length measurements. The main contribution compared to prior CDPR pose estimation EKF methods is that the L-EKF framework does not require an iterative forward kinematics algorithm to be solved each time step, reducing the computation time of the EKF. Moreover, the L-EKF is amenable to the inclusion of colored measurement noise, which provides a more realistic quantification of the kinematic uncertainty present in the cable-length measurements. Experimental results demonstrate that the L-EKF is computationally more efficient than previous forward-kinematics-based EKF methods, as well as the moderate improvement in pose estimation provided by the colored noise model.

10:05-10:10, Paper ThBT3.3	Add to My Program
A Unified End-To-End Network for Category-Level and Instance-Level Object Pose Estimation from RGB Images

Ren, Jiale	Peking University
Liu, Hong	Peking University
Liu, Jinfu	Peking University
Jiang, Peifeng	Peking University
Keywords: Deep Learning for Visual Perception, Visual Learning, Deep Learning Methods Abstract: Accurately estimating the 6-DoF pose of objects is a fundamental challenge in computer vision and robotics. While category-level pose estimation based on RGBD data has achieved good performance in recent years, estimating poses solely from RGB images remains a significant challenge. Existing RGB-based category-level methods primarily focus on recovering object point clouds from RGB images, and pose prediction is not performed end-to-end by a network. This paper presents a Category-level and Instance-level Pose Estimation Network (CIPE), which models pose estimation as a set prediction problem and enables direct pose regression from RGB images. To further enhance the network's ability to learn object poses, first, a novel learnable rotation representation that redefines rotation learning within Euclidean space is introduced to facilitate rotation regression. Additionally, we propose a prior-query fusion strategy that utilizes a pre-trained point cloud feature extraction network to integrate categorical object features with bounding boxes, thereby improving the incorporation of category information. Experimental results demonstrate that CIPE significantly outperforms existing RGB-based methods on both category-level and instance-level datasets. The code is available at https://github.com/jialeren/CIPE.

10:10-10:15, Paper ThBT3.4	Add to My Program
MonoLDP: LED Assisted Indoor Mobile Bot Monocular Depth Prediction and Pose Estimation System

Liang, Chenxin	Tsinghua Unviersity
Wang, Jingyang	Shenzhen International Graduate School, Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Sou, Kit-Wa	Tsinghua University
Luo, Xinyu	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Computer Vision for Automation, Visual Learning, RGB-D Perception Abstract: Multi-robot clusters are increasingly deployed in indoor environments, where effective communication and 3D perception are critical for coordinated operations. Monocular cameras, known for their lightweight design, cost-effectiveness, and versatility, present a promising solution for these tasks. However, relying solely on monocular cameras for comprehensive perception and communication presents significant challenges. To address this, we introduce MonoLDP, a novel system that leverages monocular cameras for depth estimation, mutual pose estimation, and visible light communication in indoor environments, providing an integrated framework to overcome these limitations. MonoLDP features a two-stage network: (1) a depth estimation module that infers depth from monocular images, and (2) a depth-guided 3D object recognition network for agent-relative localization and pose estimation. We created a custom dataset to validate the accuracy of MonoLDP. On our indoor dataset, MonoLDP outperforms the baseline by 43.39% in 3D detection and 42.39% in bird’s-eye view detection, with an average localization error of 0.104m and an orientation error of 1.66 degrees. Moreover, the depth estimation network demonstrates excellent performance on the NYU v2 dataset. Additionally, the system achieves a communication rate of 1.2 Kbps with a bit error rate below 10^(-2) at a distance of up to 4 meters using LED arrays. Our code will be released at https://github.com/RavenLiang1005/MonoLDP.git.

10:15-10:20, Paper ThBT3.5	Add to My Program
LCSPose: Efficient, Accurate and Scalable Markerless 6-DoF Pose Estimation of a Quay Crane Spreader Based on LiDAR and Camera

Zhou, Yichen	Nanyang Technological University
Zhang, Jun	Nanyang Technological University
Peng, Guohao	Nanyang Technological University
Yun, Yanpu	Nanyang Technological University
Liu, Yiyao	NANYANG Technological University
Wang, Yuanzhe	Shandong University
Wang, Danwei	Nanyang Technological University
Keywords: Field Robots, Industrial Robots, Perception for Grasping and Manipulation Abstract: Accurate Six Degrees of Freedom (6-DoF) pose estimation of Ship-To-Shore (STS) quay crane spreaders is crucial for ensuring safe and efficient container handling in port automation. However, existing pose estimation techniques face significant challenges, as camera-based systems either rely on markers, which are prone to damage, or struggle with depth estimation inaccuracies. Additionally, 3D sensor-based approaches, particularly point cloud registration (PCR), face challenges such as initial pose errors, high-latency inference, and difficulties in object identification based purely on geometric features. To address these limitations, we propose LCSPose, a LiDAR-camera fusion-based 6-DoF pose estimation method that is marker-free, accurate, efficient, and scalable. Our approach integrates three key modules: (1) a semantic-geometric segmentation module for spreader segmentation and outlier removal, (2) a spatial consistency template sampling module based on Spatial Consistency Score (SC-Score) for reliable template selection across varying distances, and (3) a multi-view coarse-to-fine pose refinement module which incorporates multi-view PCA alignment for robust initial posture prior estimation and iterative pose refinement strategy for long-range registration. Our method demonstrates a 60% improvement in registration recall over state-of-the-art (SOTA) PCR methods, achieving up to 6 cm in translation error and 0.19 degrees in rotation error, while maintaining real-time processing at 20Hz.

10:20-10:25, Paper ThBT3.6	Add to My Program
ZeroBP: Learning Position-Aware Correspondence for Zero-Shot 6D Pose Estimation in Bin-Picking

Chen, Jianqiu	Harbin Institute of Technology
Zhou, Zikun	Pengcheng Laboratory
Li, Xin	Pengcheng Laboratory
Bao, Tianpeng	Guangzhou Medical University
Zheng, Ye	JD Logistics
He, Zhenyu	Harbin Institute of Technology
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, RGB-D Perception Abstract: Bin-picking is a practical and challenging robotic manipulation task, where accurate 6D pose estimation plays a pivotal role. The workpieces in bin-picking are typically textureless and randomly stacked in a bin, which poses a significant challenge to 6D pose estimation. Existing solutions are typically learning-based methods, which require object-specific training. Their efficiency of practical deployment for novel workpieces is highly limited by data collection and model retraining. Zero-shot 6D pose estimation is a potential approach to address the issue of deployment efficiency. Nevertheless, existing zero-shot 6D pose estimation methods are designed to leverage feature matching to establish point-to-point correspondences for pose estimation, which is less effective for workpieces with textureless appearances and ambiguous local regions. In this paper, we propose ZeroBP, a zero-shot pose estimation framework designed specifically for the bin-picking task. ZeroBP learns Position-Aware Correspondence (PAC) between the scene instance and its CAD model, leveraging both local features and global positions to resolve the mismatch issue caused by ambiguous regions with similar shapes and appearances. Extensive experiments on the ROBI dataset demonstrate that ZeroBP outperforms state-of-the-art zero-shot pose estimation methods, achieving an improvement of 9.1% in average recall of correct poses.

10:25-10:30, Paper ThBT3.7	Add to My Program
Virtual Frame Rotation: A Novel Two-Stage Pose Estimation Scheme of Permanent Magnet Marker for Medical Applications

Park, Jiho	Gwangju Institute of Science and Technology
Lim, Buyong	GIST
Yoon, Jungwon	Gwangju Institutue of Science and Technology
Keywords: Medical Robots and Systems, Micro/Nano Robots, Localization Abstract: Permanent magnetic marker (PMM) has the potential to broaden the scope of medical robots by facilitating the localization of target points even in environments where vision-based methods cannot operate. However, conventional approaches rely on the accuracy of the modeling equations and are not adaptable to changes in the magnet's properties, which can occur due to factors like non-uniformity in the marker material or temperature fluctuations within the PMM. These constraints make it challenging to apply the PMM across diverse medical techniques. In this work, we introduce a novel two-stage PMM localization scheme, called Virtual Frame Rotation (VFR), designed to address this issue. VFR employs an approach that virtually rotates the observation frame of the hall sensors' output vector and checks the symmetry of the magnetic field in the rotated frame. This approach allows for robust pose estimation of the condition with variance in magnetic properties, as verified by comparing its localization performance with the conventional approach in the simulation and the real-world environments with temperature variance conditions. Based on these characteristics, VFR can expand the scope of medical applications that involve changes in the properties of magnetic markers, such as the in-body localization of magnetic macro particles for hyperthermia treatment.


ThBT4 Regular Session, 304	Add to My Program
Bioinspiration and Biomimetics 1

Chair: Mazzolai, Barbara	Istituto Italiano Di Tecnologia
Co-Chair: Ramezani, Alireza	Northeastern University

09:55-10:00, Paper ThBT4.1	Add to My Program
Back-Stepping Experience Replay with Application to Model-Free Reinforcement Learning for a Soft Snake Robot

Qi, Xinda	Michigan State University
Chen, Dong	Mississippi State University
Li, Zhaojian	Michigan State University
Tan, Xiaobo	Michigan State University
Keywords: Modeling, Control, and Learning for Soft Robots, Reinforcement Learning, Biologically-Inspired Robots Abstract: In this paper, we propose a novel technique, Back-stepping Experience Replay (BER), that is compatible with arbitrary off-policy reinforcement learning (RL) algorithms. BER aims to enhance learning efficiency in systems with approximate reversibility, reducing the need for complex reward shaping. The method constructs reversed trajectories using back-stepping transitions to reach random or fixed targets. Interpretable as a bi-directional approach, BER addresses inaccuracies in back-stepping transitions through a purification of the replay experience during learning. Given the intricate nature of soft robots and their complex interactions with environments, we present an application of BER in a model-free RL approach for the locomotion and navigation of a soft snake robot, which is capable of serpentine motion enabled by anisotropic friction between the body and ground. In addition, a dynamic simulator is developed to assess the effectiveness and efficiency of the BER algorithm, in which the robot demonstrates successful learning (reaching a 100% success rate) and adeptly reaches random targets, achieving an average speed 48% faster than that of the best baseline approach.

10:00-10:05, Paper ThBT4.2	Add to My Program
Continuous Convolution for Automated Measurement of Sperm Flagella

Jin, Yufei	The Chinese Univiersity of Hong Kong(shenzhen)
Yang, Han	The Chinese University of Hong Kong, Shenzhen
Chen, Wenyuan	University of Toronto
Wang, Xinrui	The Chinese University of Hongkong (Shenzhen)
Sun, Yu	University of Toronto
Zhang, Zhuoran	The Chinese University of Hong Kong, Shenzhen
Keywords: Automation at Micro-Nano Scales, Deep Learning Methods, Computer Vision for Automation Abstract: Quantifying sperm flagellar beating behavior (e.g., beating amplitude, frequency, and wavelength) plays a crucial role in biological research, clinical diagnostics, and the design of sperm-inspired microrobots. However, existing computational methods struggle to accurately and efficiently analyze the highly dynamic, complex, and fine structures of sperm flagella, especially when portions of the flagellum become invisible due to three-dimensional out-of-focus beating. This paper proposes an automated high-throughput tool for quantitative analysis of sperm flagellar beating. The core innovation is continuous convolution (CConv), which adaptively captures the irregular, time-varying patterns of sperm flagella while ensuring continuity in segmentation outputs, even in the presence of locally invisible regions caused by out-of-focus motion. CConv can be integrated into various neural network architectures as a plug-and-play module. Extensive experiments demonstrate that integrating CConv consistently improves the accuracy and continuity of flagella segmentation across different networks. Furthermore, utilizing a curvature-based approach, we quantified key flagellar beating parameters, including length, amplitude, frequency, and wavelength. Applying the high-throughput tool on 1200 sperm revealed that sperm from fertile donors had significantly higher flagellar beating frequency than sperm from infertile patients. The proposed automated tool unlocks high-throughput, quantitative analysis of sperm flagellar beating, showing the potential for applications in reproductive biology and engineering research.

10:05-10:10, Paper ThBT4.3	Add to My Program
Adaptive Concertina Locomotion of a Robotic Snake through Narrow Uncertain Channels

Koley, Jit	Indian Institute of Technology Bombay
Sharma, Devashish	Hindustan Institute of Technology and Science, Chennai
Chakraborty, Debraj	Indian Institute of Technology Bombay
K. Pillai, Harish	Indian Institute of Technology Bombay
Keywords: Redundant Robots, Search and Rescue Robots, Actuation and Joint Mechanisms Abstract: The problem of mimicking the concertina locomotion mode of biological snakes through narrow channels of uncertain width, using a multi-link planar serpenoid robot, is considered. A novel algorithm for generating a reference trajectory that accurately reproduces this natural gait pattern is proposed and analysed for straight channels. A modification of this algorithm leverages feedback from the joints’ current and angular velocities to dynamically adjust the robot’s movements within channels of unknown and varying widths. Experiments through rugged artificial channels of varying width show remarkable ability of the programmed snake robot to negotiate such terrains with agility and reasonable speed.

10:10-10:15, Paper ThBT4.4	Add to My Program
Bio-Inspired Distributed Neural Locomotion Controller (D-NLC) for Robust Locomotion and Emergent Behaviors

Zhang, Zhikai	Carnegie Mellon University
Guo, Siqi	Carnegie Mellon University
Kou, Henry	Carnegie Mellon University
Shikhare, Ishayu	Carnegie Mellon University
Choset, Howie	Carnegie Mellon University
Li, Lu	Carnegie Mellon University
Keywords: Biologically-Inspired Robots, Cellular and Modular Robots, Neurorobotics Abstract: With relatively fewer neurons than more complex life forms, insects are still capable of producing astonishing locomotive behaviors, such as traversing diverse habitats and making rapid gait adaptations after extreme injury or autotomy. Biologists attribute this to a chain of segmental neuron clusters (ganglia) within insect nervous systems, which act as distributed, self-organizing sensorimotor control units. Inspired by the neural structure of the Carausius morosus, the common stick insect, this research introduces the Distributed Neural Locomotion Controller (D-NLC), a modular control framework utilizing local proprioceptive feedback to modulate joint-level Central Pattern Generator (CPG) signals to produce emergent locomotive behaviors. We implemented this framework using a modular legged robot with distributed joint-level embedded computing units and assessed its performance and behavior under various experimental settings. Based on real-world experiments, we observe an overall 31.3% average increase in curvilinear motion performance under external (terrain) and internal (amputation) disturbances compared to a centralized predefined gait controller. This difference is statistically significant (P<<0.05) for larger perturbations but not for single-leg amputations. Experiments with perturbation-induced leg stance duration and leg-phase-difference analysis further validated our hypothesis regarding D-NLC's role in the robust perceptive locomotion and self-emergent gait adaptation against complex unforeseen perturbations. This proposed control framework does not require any numerical optimization or weight training processes, which are time-consuming and computationally expensive. To the best of our knowledge, this framework is the first bio-inspired neural controller deployed on a distributed embedded system.

10:15-10:20, Paper ThBT4.5	Add to My Program
Reduced-Order Model-Based Gait Generation for Snake Robot Locomotion Using NMPC

Salagame, Adarsh	Northeastern University
Sihite, Eric	California Institute of Technology
Ramezani, Milad	CSIRO
Ramezani, Alireza	Northeastern University
Keywords: Biologically-Inspired Robots, Optimization and Optimal Control, Motion Control Abstract: This paper presents an optimization-based motion planning methodology for snake robots operating in constrained environments. By using a reduced-order model, the proposed approach simplifies the planning process, enabling the optimizer to autonomously generate gaits while constraining the robot’s footprint within tight spaces. The method is validated through high-fidelity simulations that accurately model contact dynamics and the robot’s motion. Key locomotion strategies are identified and further demonstrated through hardware experiments, including successful navigation through narrow corridors.

10:20-10:25, Paper ThBT4.6	Add to My Program
AquaMILR+: Design of an Untethered Limbless Robot for Complex Aquatic Terrain Navigation

Fernandez, Matthew	Georgia Institute of Technology
Wang, Tianyu	Georgia Institute of Technology
Tunnicliffe, Galen	Georgia Institute of Technology
Dortilus, Donoven	Georgia Institute of Technology
Gunnarson, Peter	California Institute of Technology
Dabiri, John	California Insititute of Technology
Goldman, Daniel	Georgia Institute of Technology
Keywords: Biologically-Inspired Robots, Redundant Robots, Search and Rescue Robots Abstract: This paper presents AquaMILR+, an untethered limbless robot designed for agile navigation in complex aquatic environments. The robot features a bilateral actuation mechanism that models musculoskeletal actuation in many anguilliform swimming organisms which propagates a moving wave from head to tail allowing open fluid undulatory swimming. This actuation mechanism employs mechanical intelligence, enhancing the robot's maneuverability when interacting with obstacles. AquaMILR+ also includes a compact depth control system inspired by the swim bladder and lung structures of eels and sea snakes. The mechanism, driven by a syringe and telescoping leadscrew, enables depth and pitch control -- capabilities that are difficult for most anguilliform swimming robots to achieve. Additional structures, such as fins and a tail, further improve stability and propulsion efficiency. Our tests in both open water and indoor 2D and 3D heterogeneous aquatic environments highlight AquaMILR+'s capabilities and suggest a promising system for complex underwater tasks such as search and rescue and deep-sea exploration.

10:25-10:30, Paper ThBT4.7	Add to My Program
Traversing between Two Planes Using Obstacle-Aided Locomotion of a Snake Robot

Yoshida, Yuto	The University of Electro-Communications
Chin, Ching Wen	The University of Electro-Communications
Tanaka, Motoyasu	The Univ. of Electro-Communications
Keywords: Field Robots, Search and Rescue Robots, Biologically-Inspired Robots Abstract: ペーパーでは、2種類の接続パーツを提案します。非車輪付きヘビロボットの障害物支援移動 2つの平面。 1つの方法は、ロボットの頭を垂直に持ち上げることですが、他の方法は、転倒を避けるために障害物の周りに巻き付くことですより高い平面に移動するとき。オペレーターは高さを調整でき、接続部分のパラメータを変更して有効にする方法。ロボットは未知の2つの非平行平面上を移動する。を確認しました。実験による私たちの方法の有効性。


ThBT5 Regular Session, 305	Add to My Program
Model Predictive Control for Legged Robots 1

Chair: Lee, Jinoh	German Aerospace Center (DLR)
Co-Chair: Zhao, Ye	Georgia Institute of Technology

09:55-10:00, Paper ThBT5.1	Add to My Program
Adapting Gait Frequency for Posture-Regulating Humanoid Push-Recovery Via Hierarchical Model Predictive Control

Li, Junheng	University of Southern California
Le, Zhanhao	University of Southern California
Ma, Junchao	University of Southern California
Nguyen, Quan	University of Southern California
Keywords: Humanoid and Bipedal Locomotion, Optimization and Optimal Control, Whole-Body Motion Planning and Control Abstract: Current humanoid push-recovery strategies often use whole-body motion, yet they tend to overlook posture regulation. For instance, in manipulation tasks, the upper body may need to stay upright and have minimal recovery displacement. This paper introduces a novel approach to enhancing humanoid push-recovery performance under unknown disturbances and regulating body posture by tailoring the recovery stepping strategy. We propose a hierarchical-MPC-based scheme that analyzes and detects instability in the prediction window and quickly recovers through adapting gait frequency. Our approach integrates a high-level nonlinear MPC, a posture-aware gait frequency adaptation planner, and a low-level convex locomotion MPC. The planners predict the center of mass (CoM) state trajectories that can be assessed for precursors of potential instability and posture deviation. In simulation, we demonstrate improved maximum recoverable impulse by 131% on average compared with baseline approaches. In hardware experiments, a 125 ms advancement in recovery stepping timing/reflex has been observed with the proposed approach. We also demonstrate improved push-recovery performance and minimized body attitude change under 0.2 rad.

10:00-10:05, Paper ThBT5.2	Add to My Program
Robots with Attitude: Singularity-Free Quaternion-Based Model-Predictive Control for Agile Legged Robots

Zhang, Zixin	Northwestern University
Zhang, John	Carnegie Mellon University
Yang, Shuo	Carnegie Mellon University
Manchester, Zachary	Carnegie Mellon University
Keywords: Legged Robots, Optimization and Optimal Control, Body Balancing Abstract: We present a model-predictive control (MPC) framework for legged robots that avoids the singularities associated with common three-parameter attitude representations like Euler angles during large-angle rotations. Our method parameterizes the robot's attitude with singularity-free unit quaternions and makes modifications to the iterative linear-quadratic regulator (iLQR) algorithm to deal with the resulting geometry. The derivation of our algorithm requires only elementary calculus and linear algebra, deliberately avoiding the abstraction and notation of Lie groups. We demonstrate the performance and computational efficiency of quaternion MPC in several experiments on quadruped and humanoid robots.

10:05-10:10, Paper ThBT5.3	Add to My Program
Online Nonlinear MPC for Multimodal Locomotion

Taliani, Saverio	Italian Institute of Technology
Nava, Gabriele	Istituto Italiano Di Tecnologia
L'Erario, Giuseppe	Istituto Italiano Di Tecnologia
Elobaid, Mohamed	Fondazione Istituto Italiano Di Tecnologia
Romualdi, Giulio	Istituto Italiano Di Tecnologia
Pucci, Daniele	Italian Institute of Technology
Keywords: Humanoid Robot Systems, Aerial Systems: Mechanics and Control, Control Architectures and Programming Abstract: Aerial humanoid robots can enhance the efficiency and safety of rescue operations in disaster scenarios. The control of such complex machines presents many challenges, for instance, the control of the different locomotion strategies and the stabilization of the transition maneuvers. In this article, we present an online nonlinear Model Predictive Controller and the relative prediction model to stabilize walking and flying trajectories. The controller uses a reduced model to generate feasible base link references, thrust profiles, and contact forces while dealing with different locomotion strategies and transition maneuvers. The control algorithm is tested in a simulated environment using our aerial humanoid robot iRonCub under the effect of external disturbances. The proposed control strategy demonstrates to effectively stabilize the desired trajectories while keeping the problem still treatable online.

10:10-10:15, Paper ThBT5.4	Add to My Program
Terrain-Aware Model Predictive Control of Heterogeneous Bipedal and Aerial Robot Coordination for Search and Rescue Tasks

Shamsah, Abdulaziz	Georgia Institute of Technology
Jiang, Jesse	Georgia Institute of Technology
Yoon, Ziwon	Georgia Institute of Technology
Coogan, Samuel	Georgia Tech
Zhao, Ye	Georgia Institute of Technology
Keywords: Humanoid and Bipedal Locomotion, Multi-Robot Systems, Search and Rescue Robots Abstract: Humanoid robots offer significant advantages for search and rescue tasks, thanks to their capability to traverse rough terrains and perform transportation tasks. In this study, we present a task and motion planning framework for search and rescue operations using a heterogeneous robot team composed of humanoids and aerial robots. We propose a terrain-aware Model Predictive Controller (MPC) that incorporates terrain elevation gradients learned using Gaussian processes (GP). This terrain-aware MPC generates safe navigation paths for the bipedal robots to traverse rough terrain while minimizing terrain slopes, and it directs the quadrotors to perform aerial search and mapping tasks. The rescue subjects' locations are estimated by a target belief GP, which is updated online during the map exploration. A high-level planner for task allocation is designed by encoding the navigation tasks using syntactically cosafe Linear Temporal Logic (scLTL), and a consensus-based algorithm is designed for task assignment of individual robots. We evaluate the efficacy of our planning framework in simulation in an uncertain environment with various terrains and random rescue subject placements.

10:15-10:20, Paper ThBT5.5	Add to My Program
Koopman Operator Based Linear Model Predictive Control for Quadruped Trotting

Yang, Chun-Ming	University of Illinois at Chicago
Bhounsule, Pranav	University of Illinois at Chicago
Keywords: Legged Robots, Model Learning for Control, Force Control Abstract: Online optimal control of quadruped robots would enable them to adapt to varying inputs and changing conditions in real time. A common way of achieving this is linear model predictive control (LMPC), where a quadratic programming (QP) problem is formulated over a finite horizon with a quadratic cost and linear constraints obtained by linearizing the equations of motion and solved on the fly. However, the model linearization may lead to model inaccuracies. In this paper, we use the Koopman operator to create a linear model of the quadrupedal system in high dimensional space which preserves the nonlinearity of the equations of motion. Then using LMPC, we demonstrate high fidelity tracking and disturbance rejection on a quadrupedal robot. This is the first work that uses the Koopman operator theory for LMPC of quadrupedal locomotion.

10:20-10:25, Paper ThBT5.6	Add to My Program
Kinodynamic Model Predictive Control for Energy Efficient Locomotion of Legged Robots with Parallel Elasticity

Zhuang, Yulun	University of Michigan
Wang, Yichen	University of Michigan
Ding, Yanran	University of Michigan
Keywords: Legged Robots, Optimization and Optimal Control, Compliant Joints and Mechanisms Abstract: In this paper, we introduce a kinodynamic model predictive control (MPC) framework that exploits unidirectional parallel springs (UPS) to improve the energy efficiency of dynamic legged robots. The proposed method employs a hierarchical control structure, where the solution of MPC with simplified dynamic models is used to warm-start the kinodynamic MPC, which accounts for nonlinear centroidal dynamics and kinematic constraints. The proposed approach enables energy efficient dynamic hopping on legged robots by using UPS to reduce peak motor torques and energy consumption during stance phases. Simulation results demonstrated a 38.8% reduction in the cost of transport (CoT) for a monoped robot equipped with UPS during high-speed hopping. Additionally, preliminary hardware experiments show a 14.8% reduction in energy consumption.

10:25-10:30, Paper ThBT5.7	Add to My Program
Dynamic Bipedal MPC with Foot-Level Obstacle Avoidance and Adjustable Step Timing

Wang, Tianze	Florida State University
Hubicki, Christian	Florida State University
Keywords: Legged Robots, Humanoid and Bipedal Locomotion, Whole-Body Motion Planning and Control Abstract: Collision-free planning is essential for bipedal robots operating within unstructured environments. This paper presents a real-time Model Predictive Control (MPC) framework that addresses both body and foot avoidance for dynamic bipedal robots. Our contribution is two-fold: we introduce (1) a novel formulation for adjusting step timing to facilitate faster body avoidance and (2) a novel 3D foot-avoidance formulation that implicitly selects swing trajectories and footholds that either steps over or navigate around obstacles with awareness of Center of Mass (COM) dynamics. We achieve body avoidance by applying a half-space relaxation of the safe region but introduce a switching heuristic based on tracking error to detect a need to change foot-timing schedules. To enable foot avoidance and viable landing footholds on all sides of foot-level obstacles, we decompose the non-convex safe region on the ground into several convex polygons and use Mixed-Integer Quadratic Programming to determine the optimal candidate. We found that introducing a soft minimum-travel-distance constraint is effective in preventing the MPC from being trapped in local minima that can stall half-space relaxation methods behind obstacles. We demonstrated the proposed algorithms on multibody simulations on the bipedal robot platforms, Cassie and Digit, as well as hardware experiments on Digit.


ThBT6 Regular Session, 307	Add to My Program
Perception for Manipulation 1

Chair: Calli, Berk	Worcester Polytechnic Institute
Co-Chair: Liu, Tengyu	Beijing Institute for General Artificial Intelligence

09:55-10:00, Paper ThBT6.1	Add to My Program
Enhancing Robotic Perception with Low-Cost Fast Active Vision Achieving Sub-Millimeter Accurate Marker-Based Pose Estimation

Knobbe, Dennis	Technical University of Munich
Standke, Johann Jakob Wilhelm	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Visual Servoing, Computer Vision for Automation, Performance Evaluation and Benchmarking Abstract: Robust perception of the environment is a critical challenge for robots, especially those that use mobile platforms or humanoid forms to perform manipulation tasks. Active vision, leveraging strategic camera movements and adaptive imaging parameters, holds great potential for addressing critical challenges such as achieving high accuracy in precise manipulation, ensuring low latency for rapid responsiveness, and overcoming occlusions and illumination variations in dynamic environments. This paper introduces a novel, cost-effective, and easily deployable active vision system designed to enhance visual perception for robotic applications. Integrated with a novel hybrid software setup, the system utilizes ArUco markers to achieve high-accuracy, low-latency performance, boasting sub-millimeter and sub-degree accuracy at 200 Hz with a latency of less than 15 ms. Additionally, a new measurement and evaluation procedure is presented, offering benchmarking for marker-based object detection systems that for the first time includes rotation measurements as well. The benchmarking results for the proposed system indicate that achieving the desired performance levels necessitates specialized active vision measurement strategies. For instance, to ensure high positional accuracy, the system needs precise object centering, while high rotational accuracy requires accounting for lateral or rotational offsets.

10:00-10:05, Paper ThBT6.2	Add to My Program
PhysPart: Physically Plausible Part Completion for Interactable Objects

Luo, Rundong	Cornell University
Geng, Haoran	University of California, Berkeley
Deng, Congyue	Stanford
Li, Puhao	Tsinghua University
Wang, Zan	Beijing Institute of Technology
Jia, Baoxiong	Beijing Institute for General Artificial Intelligence
Guibas, Leonidas	Stanford University
Huang, Siyuan	Beijing Institute for General Artificial Intelligence
Keywords: Perception for Grasping and Manipulation, Manipulation Planning Abstract: Interactable objects are ubiquitous in our daily lives. Recent advances in 3D generative models make it possible to automate the modeling of these objects, benefiting a range of applications from 3D printing to the creation of robot simulation environments. However, while significant progress has been made in modeling 3D shapes and appearances, modeling object physics, particularly for interactable objects, remains challenging due to the physical constraints imposed by inter-part motions. In this paper, we tackle the problem of physically plausible part completion for interactable objects, aiming to generate 3D parts that not only fit precisely into the object but also allow smooth part motions. To this end, we propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process. Additionally, we demonstrate the generation of dependent parts, paving the way toward sequential part generation for objects with complex part-whole hierarchies. Experimentally, we introduce a new metric for measuring physical plausibility based on motion success rates. Our model outperforms existing baselines over shape and physical metrics, especially those that do not adequately model physical constraints. We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.

10:05-10:10, Paper ThBT6.3	Add to My Program
Generalizable Zero-Shot Object Pose Estimation for Bin-Picking

Zhang, Zijiang	Kyushu Institute of Technology
Huimin, Lu	Southeast University
Jintong, Cai	Southeast University
Kamiya, Tohru	Kyushu Institute of Technology
Serikawa, Seiichi	Kyushu Institute of Technology
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: Abstract—Unordered grasping in industrial robotic manipulation requires precise six-degree-of-freedom (6D) pose estimation. However, existing methods often struggle with unknown objects and require retraining, limiting their practicality. Traditional 3D point-pair feature methods, while training-free, perform poorly with textured symmetric objects. We propose a generalizable approach for zero-shot 6D pose estimation without retraining. Our method consists of two steps: generating CAD-based templates through real-time rendering for coarse pose estimation, and refining poses using semantic point-pair features aligned with the camera viewpoint. We conducted experiments on seven core datasets from the Benchmark for 6D Object Pose Estimation (BOP) challenge, and the results are publicly available on the BOP website. Integration into a robotic grasping system further highlights its high precision and fast execution, making it idealfor applications such as bin-picking.

10:10-10:15, Paper ThBT6.4	Add to My Program
Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing

Mack, Lukas	University of Augsburg
Grüninger, Felix	Max Planck Institute for Intelligent Systems
Richardson, Benjamin A.	Max Planck Institute for Intelligent Systems
Lendway, Regine	University of Tuebingen
Kuchenbecker, Katherine J.	Max Planck Institute for Intelligent Systems
Stueckler, Joerg	University of Augsburg
Keywords: Perception for Grasping and Manipulation, Sensor Fusion, Force and Tactile Sensing Abstract: Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot's own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.

10:15-10:20, Paper ThBT6.5	Add to My Program
Proactive Tactile Exploration for Object-Agnostic Shape Reconstruction from Minimal Visual Priors

Oikonomou, Paris	National Technical University of Athens (NTUA)
Retsinas, George	National Technical University of Athens
Maragos, Petros	National Technical University of Athens
Tzafestas, Costas S.	ICCS - Inst of Communication and Computer Systems
Keywords: Perception for Grasping and Manipulation Abstract: The perception of an object’s surface is important for robotic applications enabling robust object manipulation. The level of accuracy in such a representation affects the outcome of the action planning, especially during tasks that require physical contact, e.g. grasping. In this paper, we propose a novel iterative method for 3D shape reconstruction consisting of two steps. At first, a mesh is fitted on data points acquired from the object’s surface, based on a single primitive template. Subsequently, the mesh is properly adjusted to adequately represent local deformities. Moreover, a novel proactive tactile exploration strategy aims at minimizing the total uncertainty with the least number of contacts, while reducing the risk of contact failure in case the estimated surface differs significantly from the real one. The performance of the methodology is evaluated both in 3D simulation and on a real setup.

10:20-10:25, Paper ThBT6.6	Add to My Program
Multi-Layer Feature Exchange Transformer for Multi-View 6D Object Pose Estimation in Robot Bin Picking

Khalil, Momen	Technical University of Munich
Dietrich, Vincent	Siemens Corporate Technology
Ilic, Slobodan	Technische Universitat Munchen
Keywords: Perception for Grasping and Manipulation, Computer Vision for Automation, Deep Learning for Visual Perception Abstract: Accurate 6D object pose estimation is crucial in industrial automation, particularly in robotic bin picking, where objects are often textureless, reflective, and arranged in cluttered environments. Multi-view pose estimation methods offer significant advantages over single-view methods by providing more comprehensive information, effectively handling occlusions and lack of features, and resolving depth ambiguities. However, current multi-view methods often rely on late-stage information fusion, limiting their ability to fully exploit complementary multi-view data. This paper presents a novel approach to enhance multi-view 6D pose estimation by introducing a Feature Exchange Transformer (FET) for early-stage feature fusion. This approach leverages self-attention and epipolar cross-attention mechanisms to enable multi-layer feature aggregation across views. Additionally, we introduce a coarse-to-fine strategy for an efficient feature exchange at multiple network layers. Our method, implemented on top of EpiSurfEmb, enhances the utilization of multi-view information, leading to significant improvements in pose estimation accuracy and robustness, especially in challenging bin-picking scenarios. We evaluate our approach on the ROBI dataset, demonstrating that it outperforms both the baseline EpiSurfEmb and other state-of-the-art multi-view pose estimation methods


ThBT7 Regular Session, 309	Add to My Program
Assistive Human-Robot Interaction

Chair: Yanco, Holly	UMass Lowell
Co-Chair: Haring, Kerstin Sophie	University of Denver

09:55-10:00, Paper ThBT7.1	Add to My Program
DRAGON: A Dialogue-Based Robot for Assistive Navigation with Visual Language Grounding

Liu, Shuijing	The University of Texas at Austin
Hasan, Aamir	University of Illinois Urbana-Champaign
Hong, Kaiwen	University of Illinois at Urbana Champaign
Wang, Runxuan	University of Illinois at Urbana Champaign
Chang, Peixin	University of Illinois at Urbana Champaign
Mizrachi, Zachary	University of Illinois at Urbana Champaign
Lin, Justin	University of Illinois at Urbana-Champaign
McPherson, D. Livingston	University of Illinois
Rogers, Wendy	University of Illinois Urbana-Champaign
Driggs-Campbell, Katherine	University of Illinois at Urbana-Champaign
Keywords: Human-Centered Robotics, Natural Dialog for HRI, AI-Enabled Robotics Abstract: Persons with visual impairments (PwVI) have difficulties understanding and navigating spaces around them. Current wayfinding technologies either focus solely on navigation or provide limited communication about the environment. Motivated by recent advances in visual-language grounding and semantic navigation, we propose DRAGON, a guiding robot powered by a dialogue system and the ability to associate the environment with natural language. By understanding the commands from the user, DRAGON is able to guide the user to the desired landmarks on the map, describe the environment, and answer questions from visual observations. Through effective utilization of dialogue, the robot can ground the user's free-form descriptions to landmarks in the environment, and give the user semantic information through spoken language. We conduct a user study with blindfolded participants in an everyday indoor environment. Our results demonstrate that DRAGON is able to communicate with the user smoothly, provide a good guiding experience, and connect users with their surrounding environment in an intuitive manner. Videos are available at https://sites.google.com/view/dragon-wayfinding/home.

10:00-10:05, Paper ThBT7.2	Add to My Program
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired

Han, ByungOk	ETRI
Yun, Woo-han	Electronics and Telecommunications Research Institute (ETRI)
Seo, BeomSu	ETRI
Kim, Jaehong	ETRI
Keywords: Multi-Modal Perception for HRI, Data Sets for Robot Learning, Natural Dialog for HRI Abstract: Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM.

10:05-10:10, Paper ThBT7.3	Add to My Program
FitnessAgent: A Unified Agent Framework for Open-Set and Personalized Fitness Evaluation

Tang, Zhenhui	Shanghai Jiaotong University
jiahao Li, Ljh	Dalian University of Technology
Guo, Ping	Intel
Tian, Bowen	University of Electronic Science and Technology of China
Xing, Qingjun	Beijing Sport University
Xing, XuYang	Nanjing University of Science and Technology
Wang, Peng	Intel
Keywords: Multi-Modal Perception for HRI, Computer Vision for Automation, Data Sets for Robotic Vision Abstract: Robotic systems face challenges in performing open-set and personalized fitness evaluations, especially when adapting to new exercises and individual user needs. This paper introduces FitnessAgent, a unified agent framework designed to address these challenges. Unlike traditional systems that rely on pre-trained neural networks or fixed rule-based criteria, FitnessAgent can assess any exercise without prior training, adapting evaluation metrics based on expert knowledge and user-specific requirements. The system breaks down fitness evaluation tasks into combinations of metrics, each calculated using measurable operators such as angles, distances, and positions. By leveraging a set of primitive, exercise-agnostic operators, a large language model (LLM)-based planner dynamically selects and combines these operators for each task. The open-set capability of FitnessAgent is validated through experiments on both the widely-used Functional Movement Screen dataset and a newly collected isometric pose dataset. Results highlight the system's flexibility in handling new movements and its ability to adapt to personalized evaluation criteria without the need for code or algorithm modifications. FitnessAgent offers a scalable and personalized solution for fitness evaluation, making it well-suited for robotic applications that require adaptability to diverse user needs.

10:10-10:15, Paper ThBT7.4	Add to My Program
A Reinforcement Learning-Based Social Robot for Personalized Learning in Children with Autism

Askari, Farzaneh	McGill University
Abdollahi, Hojjat	University of Denver
Haring, Kerstin Sophie	University of Denver
Mahoor, Mohammad	University of Denver
Keywords: Human-Robot Collaboration, Reinforcement Learning, Robot Companions Abstract: This work hypothesizes that a social robot that uses reinforcement learning can effectively adapt to individual differences in teaching imitation skills (e.g., facial expressions) to children with autism spectrum disorder. We developed an active learning method based on reinforcement learning to personalize human-robot interaction sessions based on each child's imitation performance and preference. We evaluated this method with five children with autism spectrum disorder, and the results demonstrated varying responses to different methods of presenting facial expressions to teach imitation skills. We found that the robot consistently promoted increased shared attention, including visual contact and physical proximity during imitation tasks. This suggests that adaptive human-robot interactions can cater to the unique needs of children with autism, offering a promising avenue for personalized intervention. Additionally, we discuss observed qualitative insights from our study and considerations for robot behavior mitigation strategies to sustain engagement.

10:15-10:20, Paper ThBT7.5	Add to My Program
Comparison of User Interface Paradigms for Assistive Robotic Manipulators

Sinclaire, Amelia	University of Massachusetts Lowell
Wilkinson, Alexander	University of Massachusetts Lowell
Kim, Boyoung	George Mason University Korea
Yanco, Holly	UMass Lowell
Keywords: Design and Human Factors, Virtual Reality and Interfaces Abstract: This paper presents the results of a within-subjects user study with 27 participants over the age of 60, comparing the use of two different user interfaces for an assistive robot scooter. The graphical user interface (GUI) shows a representation of the environment on a 10-inch touchscreen. The tangible user interface (TUI) consists of a joystick, a box of buttons, and a projector -- designed to keep the user's attention in the real world. Trends suggest that the TUI could help mitigate difficulty caused by highly cluttered environments, as well as differences in individual spatial reasoning ability, but additional studies are needed.

10:20-10:25, Paper ThBT7.6	Add to My Program
VQA-Driven Event Maps for Assistive Navigation for People with Low Vision in Urban Environments

Morales, Joseph	Massachusetts Institute of Technology
Gebregziabher, Bruk	Biel Glasses
Cabañeros, Alex	Biel Glasses
Sanchez-Riera, Jordi	IRI, CSIC-UPC
Keywords: Multi-Modal Perception for HRI, Semantic Scene Understanding, Human Performance Augmentation Abstract: We introduce a novel framework for assistive urban navigation for individuals with low vision. Utilizing a smart glasses platform developed by Biel Glasses, which provide a continuous stream of stereo images and GPS fixes, we generate an textit{Event Map} based on key semantic elements extracted by carefully prompted visual question-answering (VQA) models. For individuals with blurry or reduced fields of vision (low vision), traversing city streets poses a variety of challenges; they may struggle to perceive construction work, potholes, crowded sidewalks, and other ambiguous obstacles obstructing their paths. Some tasks, such as distinguishing traffic light signals, are nigh impossible without assistance from a companion or city infrastructure aimed towards accessibility. Although the majority of these problems may be solved with individually tailored traditional computer vision algorithms, developing and running a suite of these algorithms is challenging and resource demanding. Therefore, our proposed solution capitalizes on a single underlying implementation that need only be extended by adding queries. We validate our approach using a custom dataset of over 1,300 annotated images from various locations around Barcelona, reporting performance across different urban navigation tasks. We demonstrate the performance of the end to end system on a run of data collected by the Biel Glasses platform.


ThBT8 Regular Session, 311	Add to My Program
Aerial Robots 4

Chair: Aloimonos, Yiannis	University of Maryland
Co-Chair: Foong, Shaohui	Singapore University of Technology and Design

09:55-10:00, Paper ThBT8.1	Add to My Program
A Robust High-Strength Multi-Surface Rapid UV-Curable Payload Installation System for Generic Multirotors Via Impact Delivery

Lim, Ryan Jon Hui	Singapore University of Technology & Design
Tan, Jeck Chuang	Singapore University of Technology and Design
Ng, Matthew	Singapore University of Technology and Design
Low, Hong Yee	Singapore University of Technology and Design
Foong, Shaohui	Singapore University of Technology and Design
Keywords: Aerial Systems: Applications, Field Robots Abstract: This letter details the design and development of a novel 3D-printed, lightweight and rapid-curing automated payload installation system for aerial robots, using a 3D printed resin-filled adhesive carrier tile (ACT). Its structure is designed to fracture and disperse ultraviolet (UV) curable resin on impact, delivered with a lightweight spring-driven impactor that rams the tile against a target surface. The dispersed resin is then cured with UV light. Shear-testing experiments with 40×40 mm ACTs across common building materials, surface conditions and roughness demonstrate loading exceeding 900N only after 10 seconds of curing, showcasing the strength, robustness and speed of the proposed system. Automated payload installation experiments show potential for applications requiring strong and permanent bonds to wall structures, such as sensor payloads or tether points within urban environments. To the authors’ knowledge, this is the first work employing wet UV adhesives for payload installation via multirotors.

10:00-10:05, Paper ThBT8.2	Add to My Program
Multi-View Stereo with Geometric Encoding for Dense Scene Reconstruction

Yang, Guidong	The Chinese University of Hong Kong
Cao, Rui	The Chinese University of Hong Kong
Wen, Junjie	The Chinese University of Hong Kong
Zhao, Benyun	The Chinese University of Hong Kong
Li, Qingxiang	The Chineses University of Hong Kong
Huang, Yijun	The Chinese University of Hong Kong
Lei, Lei	City University of Hong Kong
Chen, Xi	The Chinese University of Hong Kong
Lam, Alan Hiu-Fung	The Chinese University of Hong Kong,
Liu, Yunhui	Chinese University of Hong Kong
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Aerial Systems: Applications Abstract: Multi-view stereo (MVS) implicitly encodes photometric and geometric cues into the cost volume for multi-view correspondence matching, transferring insufficient geometric cues essential to depth estimation and reconstruction. This paper proposes GE-MVS, a novel multi-view stereo network with geometric encoding for more accurate and complete depth estimation and point cloud reconstruction. First, the cross-view adaptive cost volume aggregation module is proposed to strengthen multi-view geometric cues encoding during cost volume construction. Then, the depth consistency optimization is performed in the 3D point space during learning by invoking ground-truth depth cues from adjacent views. Finally, the surface normal geometries are explicitly encoded to refine the sampled depth hypotheses to be consistent in the local neighbor regions. Extensive experiments on the standard MVS benchmarks including DTU, Tanks and Temples, and BlendedMVS demonstrate the state-of-the-art depth estimation and point cloud reconstruction performance of GE-MVS. The GE-MVS is further deployed in real-world experiments for UAV-based large-scale reconstruction, where our method outperforms the prevalent industrial reconstruction solutions concerning reconstruction efficiency and efficacy. Our project page is: https://cuhk-usr-group.github.io/GE-MVS/

10:05-10:10, Paper ThBT8.3	Add to My Program
MicroASV: An Affordable 3D-Printed Centimeter-Scale Autonomous Surface Vehicle

Macauley, Kevin	University of Wisconsin-Madison
Chen, Zhiheng	Cornell University
Wang, Wei	University of Wisconsin-Madison
Keywords: Marine Robotics, Swarm Robotics, Field Robots Abstract: This paper introduces the design, fabrication, and autonomous control of MicroASV, a low-cost, centimeter-scale autonomous surface Vehicle (ASV). MicroASV has a square footprint with a side length of 85 mm. Its propulsion system consists of four custom water jets arranged in a “Diamond”- shaped actuator configuration, powered by magnetically coupled brushless motors. This setup allows for complete 2D mobility, enabling forward and backward motion, lateral translation, and in-place rotation. The MicroASV is built using commercially available motors and 3D-printed components, creating a modular, appendage-free structure that is simple to assemble. An onboard camera and inertial measurement unit (IMU) are integrated to enable real-time localization, with position and heading controllers developed to provide autonomous feedback control. Preliminary experiments validate the platform’s effectiveness in motion, sensing, and control, establishing MicroASV as a valuable tool for studying centimeter-scale ASV control, both individually and in collective swarm operations.

10:10-10:15, Paper ThBT8.4	Add to My Program
Airflow Source Seeking on Small Quadrotors Using a Single Flow Sensor

Thomas, Lenworth	Carnegie Mellon University
Bridges, Tjaden	Carnegie Mellon University
Bergbreiter, Sarah	Carnegie Mellon University
Keywords: Aerial Systems: Applications, Environment Monitoring and Management, Reactive and Sensor-Based Planning Abstract: As environmental disasters happen more frequently and severely, seeking the source of pollutants or harmful particulates using plume tracking becomes even more important. Plume tracking on small quadrotors would allow these systems to operate around humans and fly in more confined spaces, but can be challenging due to poor sensitivity and long response times from gas sensors that fit on small drones. In this work, we present an approach to complement chemical plume tracking with airflow source-seeking behavior using a custom flow sensor that can sense both airflow magnitude and direction on small quadrotors (<100 g). We use this sensor to implement a modified version of the `Cast and Surge' algorithm that takes advantage of flow direction sensing to find and navigate towards flow sources. A series of characterization experiments verified that the system can detect airflow while in flight and reorient the quadrotor toward the airflow. Several trials with random starting locations and orientations were used to show that our source-seeking algorithm can reliably find a flow source. This work aims to provide a foundation for future platforms that can use flow sensors in concert with other sensors to enable richer plume tracking data collection and source-seeking.

10:15-10:20, Paper ThBT8.5	Add to My Program
Air-FAR: Fast and Adaptable Routing for Aerial Navigation in Large-Scale Complex Unknown Environments

He, Botao	University of Maryland
Chen, Guofei	Carnegie Mellon University
Fermuller, Cornelia	University of Maryland
Aloimonos, Yiannis	University of Maryland
Zhang, Ji	Carnegie Mellon University
Keywords: Field Robots, Task and Motion Planning, Aerial Systems: Perception and Autonomy Abstract: This paper presents a novel method for real-time 3D navigation in large-scale, complex environments using a hierarchical 3D visibility graph (V-graph). The proposed algorithm addresses the computational challenges of V-graph construction and shortest path search on the graph simulta- neously. By introducing hierarchical 3D V-graph construction with heuristic visibility update, the 3D V-graph is constructed in O(K ·n2logn) time, which guarantees real-time performance. The proposed iterative divide-and-conquer path search method can achieve near-optimal path solutions within the constraints of real-time operations. The algorithm ensures efficient 3D V- graph construction and path search. Extensive simulated and real-world environments validated that our algorithm reduces the travel time by 42%, achieves up to 24.8% higher trajectory efficiency, and runs faster than most benchmarks by orders of magnitude in complex environments. The code and developed simulator have been open-sourced to facilitate future research.

10:20-10:25, Paper ThBT8.6	Add to My Program
Multi-Agent Visual-Inertial Localization for Integrated Aerial Systems with Loose Fusion of Odometry and Kinematics

Lai, Ganghua	Beijing Institute of Technology
Shi, Chuanbeibei	Univeristy of Bristol
Wang, Kaidi	Beijing Institute of Technology
Yu, Yushu	Beijing Institute of Technology
Dong, Yiqun	Nanyang Technological University
Franchi, Antonio	University of Twente / Sapienza University of Rome
Keywords: Aerial Systems: Applications, Localization, Multi-Robot SLAM Abstract: Reliably and efficiently estimating the relative pose and global localization of robots in a common reference for Integrated Aerial Platforms (IAPs) is a challenging problem. Unlike unmanned aerial vehicle (UAV) swarms, where the agent individual is able to move freely, IAPs connect UAV agents with mechanical joints, such as spherical joints, and form a rigid central platform, limiting the degree of freedom (DOF) of agents. Traditional methods, which rely on forming loop closures, object detection, or range sensors, suffer from degeneration or inefficiency due to the restricted relative motion between agents. In this paper, we present a centralized multi-agent localization system that fuses the internal kinematic constraints of IAPs and odometry measurements, using only visual-inertial suits for ego-motion estimation for agents and an additional 9-DOF Inertial Measurement Unit (IMU) attached to the central platform for posture estimation. A general formulation for kinematic constraints is derived without requiring knowledge about detailed kinematic parameters. A sliding-window optimization-based state estimator is constructed to estimate the relative transformation between agents. Our proposed approach is validated in our collected dataset. The results show that the proposed method reduces the global localization drift by 27.15% and relative localization error by 53.4% in the translation part and 36.99% in the rotation part compared to the baseline.

10:25-10:30, Paper ThBT8.7	Add to My Program
Multi Map Visual Localization for Unmanned Aerial Vehicles

Lømo, Tobias	University of Oslo
Maffei, Renan	Federal University of Rio Grande Do Sul
Kolberg, Mariana	UFRGS
Torresen, Jim	University of Oslo
Keywords: Aerial Systems: Perception and Autonomy, Localization, Vision-Based Navigation Abstract: Localization has long been an essential area of research within robotics. The popularity of using Unmanned Aerial Vehicles (UAVs) to solve different tasks has increased and is expected to continue. Developing a robust complementary system to the Global Navigation Satellite Systems (GNSS) used today has been researched, and visual localization using cameras and satellite images is a popular choice to use. One of the challenges with using satellite images is that different images over the same area can impact the system’s performance. This article proposes a novel approach called Multi Map Visual Localization (MMVL), a method to use multiple satellite images simultaneously, which is combined using a weighted average of probability maps. The proposal uses a convolutional neural network (CNN) with a caching strategy together with Monto Carlo Localization (MCL). MMVL achieves excellent robustness compared to other approaches and manages to estimate the correct location on all test flights. At the same time, using multiple satellite images does not significantly impact accuracy and computation time.


ThBT9 Regular Session, 312	Add to My Program
Task and Motion Planning 1

Chair: Morales, Marco	University of Illinois Urbana-Champaign & Instituto Tecnológico Autónomo De México
Co-Chair: Beetz, Michael	University of Bremen

09:55-10:00, Paper ThBT9.1	Add to My Program
Task and Motion Planning for Execution in the Real

Pan, Tianyang	Rice University
Shome, Rahul	The Australian National University
Kavraki, Lydia	Rice University
Keywords: Task and Motion Planning, Motion and Path Planning, Manipulation Planning, Task Planning Abstract: Task and motion planning represents a powerful set of hybrid planning methods that combine reasoning over discrete task domains and continuous motion generation. Traditional reasoning necessitates task domain models and enough information to ground actions to motion planning queries. Gaps in this knowledge often arise from sources like occlusion or imprecise modeling. This work generates task and motion plans that include actions cannot be fully grounded at planning time. During execution, such an action is handled by a provided human-designed or learned closed-loop behavior. Execution combines offline planned motions and online behaviors till reaching the task goal. Failures of behaviors are fed back as constraints to find new plans. Forty real-robot trials and motivating demonstrations are performed to evaluate the proposed framework and compare against state-of-the-art. Results show faster execution time, less number of actions, and more success in problems where diverse gaps arise. The experiment data is shared for researchers to simulate these settings. The work shows promise in expanding the applicable class of realistic partially grounded problems that robots can address.

10:00-10:05, Paper ThBT9.2	Add to My Program
Automated Planning Domain Inference for Task and Motion Planning

Huang, Jinbang	University of Toronto
Tao, Allen	University of Toronto
Marco, Rozilyn	University of Toronto
Bogdanovic, Miroslav	University of Toronto
Kelly, Jonathan	University of Toronto
Shkurti, Florian	University of Toronto
Keywords: Task and Motion Planning, Integrated Planning and Learning Abstract: Task and motion planning (TAMP) frameworks address long and complex planning problems by integrating high-level task planners with low-level motion planners. However, existing TAMP methods rely heavily on the manual design of planning domains that specify the preconditions and postconditions of all high-level actions. This paper proposes a method to automate planning domain inference from a handful of test-time trajectory demonstrations, reducing the reliance on human design. Our approach incorporates a deep learning-based estimator that predicts the appropriate components of a domain for a new task and a search algorithm that refines this prediction, reducing the size and ensuring the utility of the inferred domain. Our method can generate new domains from minimal demonstrations at test time, enabling robots to handle complex tasks more efficiently. We demonstrate that our approach outperforms behavior cloning baselines, which directly imitate planner behavior, in terms of planning performance and generalization across a variety of tasks. Additionally, our method reduces computational costs and data amount requirements at test time for inferring new planning domains.

10:05-10:10, Paper ThBT9.3	Add to My Program
Shadow Program Inversion with Differentiable Planning: A Framework for Unified Robot Program Parameter and Trajectory Optimization

Alt, Benjamin	ArtiMinds Robotics
Kienle, Claudius	ArtiMinds Robotics GmbH
Katic, Darko	HFT STUTTGART
Jäkel, Rainer	Karlsruhe Institute of Technology
Beetz, Michael	University of Bremen
Keywords: Motion and Path Planning, Task and Motion Planning, Integrated Planning and Learning Abstract: This paper presents SPI-DP, a novel first-order optimizer capable of optimizing robot programs with respect to both high-level task objectives and motion-level constraints. To that end, we introduce DGPMP2-ND, a differentiable collision-free motion planner for serial N-DoF kinematics, and integrate it into an iterative, gradient-based optimization approach for generic, parameterized robot program representations. SPI-DP allows first-order optimization of planned trajectories and program parameters with respect to objectives such as cycle time or smoothness subject to e.g. collision constraints, while enabling humans to understand, modify or even certify the optimized programs. We provide a comprehensive evaluation on two practical household and industrial applications.

10:10-10:15, Paper ThBT9.4	Add to My Program
AlignBot: Aligning VLM-Powered Customized Task Planning with User Reminders through Fine-Tuning for Household Robots

Zhaxizhuoma, Zhaxizhuoma	Shanghai Artificial Intelligence Laboratory
Chen, Pengan	The University of Hong Kong
Wu, Ziniu	University of Bristol
Sun, Jiawei	Shanghai Artificial Intelligence Laboratory
Wang, Dong	Shanghai Artificial Intelligence Laboratory
Zhou, Peng	Great Bay University
Cao, Nieqing	Binghamton University
Ding, Yan	SUNY Binghamton
Zhao, Bin	Northwestern Polytechnical University
Li, Xuelong	Northwestern Polytechnical University
Keywords: Task and Motion Planning, Human-Centered Robotics, Learning from Experience Abstract: This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/

10:15-10:20, Paper ThBT9.5	Add to My Program
Curiosity-Driven Imagination: Discovering Plan Operators and Learning Associated Policies for Open-World Adaptation

Lorang, Pierrick	AIT Austrian Institute of Technology GmbH - Tufts University
Lu, Hong	Tufts University
Scheutz, Matthias	Tufts University
Keywords: Integrated Planning and Learning, Task and Motion Planning, Learning from Experience Abstract: Adapting quickly to dynamic, uncertain environments—often called ``open worlds"—remains a major challenge in robotics. Traditional Task and Motion Planning (TAMP) approaches struggle to cope with unforeseen changes, are data-inefficient when adapting, and do not leverage world models during learning. We address this issue with a hybrid planning and learning system that integrates two models: a low-level neural network-based model that learns stochastic transitions and drives exploration via an Intrinsic Curiosity Module (ICM), and a high-level symbolic planning model that captures abstract transitions using operators, enabling the agent to plan in an ``imaginary" space and generate reward machines. Our evaluation in a robotic manipulation domain with sequential novelty injections demonstrates that our approach converges faster and outperforms state-of-the-art hybrid methods.

10:20-10:25, Paper ThBT9.6	Add to My Program
Optimization-Based Task and Motion Planning under Signal Temporal Logic Specifications Using Logic Network Flow

Lin, Xuan	UCLA
Ren, Jiming	Georgia Institute of Technology
Coogan, Samuel	Georgia Tech
Zhao, Ye	Georgia Institute of Technology
Keywords: Task and Motion Planning, Path Planning for Multiple Mobile Robots or Agents, Formal Methods in Robotics and Automation Abstract: This paper proposes an optimization-based task and motion planning framework, named "Logic Network Flow", to integrate signal temporal logic (STL) specifications into efficient mixed-binary linear programmings. In this framework, temporal predicates are encoded as polyhedron constraints on each edge of the network flow, instead of as constraints between the nodes as in the traditional Logic Tree formulation. Synthesized with Dynamic Network Flows, Logic Network Flows render a tighter convex relaxation compared to Logic Trees derived from these STL specifications. Our formulation is evaluated on several multi-robot motion planning case studies. Empirical results demonstrate that our formulation outperforms Logic Tree formulation in terms of computation time for several planning problems. As the problem size scales up, our method still discovers better lower and upper bounds by exploring fewer number of nodes during the branches.

10:25-10:30, Paper ThBT9.7	Add to My Program
Integrating Active Sensing and Rearrangement Planning for Efficient Object Retrieval from Unknown, Confined, Cluttered Environments

Kim, Junyoung	Purdue University
Ren, Hanwen	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Task and Motion Planning, Reactive and Sensor-Based Planning, Task Planning Abstract: Retrieving target objects from unknown, confined spaces remains a challenging task that requires integrated, task-driven active sensing and rearrangement planning. Previous approaches have independently addressed active sensing and rearrangement planning, limiting their practicality in real-world scenarios. This paper presents a new, integrated heuristic-based active sensing and Monte-Carlo Tree Search (MCTS)-based retrieval planning approach. These components provide feedback to one another to actively sense critical, unobserved areas suitable for the retrieval planner to plan a sequence for relocating path-blocking obstacles and a collision-free trajectory for retrieving the target object. We demonstrate the effectiveness of our approach using a robot arm equipped with an in-hand camera in both simulated and real-world confined, cluttered scenarios. Our framework is compared against various state-of-the-art methods. The results indicate that our proposed approach outperforms baseline methods by a significant margin in terms of the success rate, the object rearrangement planning time consumption and the number of planning trials before successfully retrieving the target.


ThBT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 4

Chair: Keren, Sarah	Technion - Israel Institute of Technology
Co-Chair: Zhao, Lin	National University of Singapore

09:55-10:00, Paper ThBT10.1	Add to My Program
A Cooperative Bearing-Rate Approach for Observability-Enhanced Target Motion Estimation

Zheng, Canlun	Westlake University
Guo, Hanqing	Westlake University
Zhao, Shiyu	Westlake University
Keywords: Sensor Networks, Localization Abstract: Vision-based target motion estimation is a fundamental problem in many robotic tasks. The existing methods have the limitation of low observability and, hence, face challenges in tracking highly maneuverable targets. Motivated by the aerial target pursuit task where a target may maneuver in 3D space, this paper studies how to further enhance observability by incorporating the emph{bearing rate} information that has not been well explored in the literature. The main contribution of this paper is to propose a new cooperative estimator called STT-R (Spatial-Temporal Triangulation with bearing Rate), which is designed under the framework of distributed recursive least squares. This theoretical result is further verified by numerical simulation and real-world experiments. It is shown that the proposed STT-R algorithm can effectively generate more accurate estimations and effectively reduce the lag in velocity estimation, enabling tracking of more maneuverable targets.

10:00-10:05, Paper ThBT10.2	Add to My Program
Overlapping Free: Anchorless UWB-Assisted Relative Pose Estimation for Multi-Robot Systems

Yun, Yanpu	Nanyang Technological University
Peng, Guohao	Nanyang Technological University
Zhou, Yichen	Nanyang Technological University
Zhang, Jun	Nanyang Technological University
Liu, Yiyao	NANYANG Technological University
Mao, Kaimin	Nanyang Technological University
Wang, Danwei	Nanyang Technological University
Keywords: Multi-Robot Systems Abstract: Accurate Relative Pose Estimation (RPE) is critical for effective collaboration of multi-robot systems. Traditional methods using cameras or LiDARs heavily rely on overlapping Fields of View (FoV) between robots, which is highly demanding in practical applications and may hinder collaboration efficiency. To accommodate this issue, we propose Anchorless UWB-Assisted Relative Pose Estimation (AURPE), a novel approach that leverages ultra-wideband (UWB) technology in an anchorless setup to achieve multi-robot RPE without requiring overlapping FoVs or external infrastructure. AURPE first estimates the initial relative poses between robots using inter-robot UWB ranging combined with a Bayesian framework and constrained optimization. During robot operation, AURPE continuously refines the relative poses by integrating UWB measurements with LiDAR-inertial odometry (LIO) and employs a consensus voting mechanism to identify the most reliable pose estimates. Additionally, a pose graph-based back-end optimization is incorporated to enhance the accuracy of both initial and real-time relative pose. Extensive simulations and real-world experiments demonstrate that AURPE achieves accurate RPE even in non-overlapping scenarios where traditional methods fail. Compared to state-of-the-art point cloud registration methods, AURPE shows superior performance in both accuracy and robustness, highlighting its potential to significantly enhance cooperative tasks in multi-robot systems operating in complex environments.

10:05-10:10, Paper ThBT10.3	Add to My Program
Maintaining Strong R-Robustness in Reconfigurable Multi-Robot Networks Using Control Barrier Functions

Lee, Haejoon	University of Michigan
Panagou, Dimitra	University of Michigan, Ann Arbor
Keywords: Networked Robots, Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems Abstract: In leader-follower consensus, strong r-robustness of the communication graph provides a sufficient condition for followers to achieve consensus in the presence of misbehaving agents. Previous studies have assumed that robots can form and/or switch between predetermined network topologies with known robustness properties. However, robots with distance-based communication models may not be able to achieve these topologies while moving through spatially constrained environments, such as narrow corridors, to complete their objectives. This paper introduces a Control Barrier Function (CBF) that ensures robots maintain strong r-robustness of their communication graph above a certain threshold without maintaining any fixed topologies. Our CBF directly addresses robustness, allowing robots to have flexible reconfigurable network structure while navigating to achieve their objectives. The efficacy of our method is tested through various simulation and hardware experiments.

10:10-10:15, Paper ThBT10.4	Add to My Program
Online Waypoint Recognition of Controlled Agents in Uncertain Environments

Guo, Jia	Cornell University
Surve, Sushrut	Cornell University
He, Zilong	Cornell University
Ferrari, Silvia	Cornell University
Keren, Sarah	Technion - Israel Institute of Technology
Keywords: Cooperating Robots, Integrated Planning and Control, Autonomous Agents Abstract: For multi-robot teams with limited communication, the ability to rapidly recognize the intention of a teammate via its exhibited behavior is key to achieving effective collaboration. While current research on plan and goal recognition provide powerful tools, most of them rely on a high-level abstraction of the environment and of its dynamics. We propose online waypoint recognition (OWR) that incorporates knowledge about the dynamic models into the analysis of the observed agent behavior. Our algorithm takes the form of a Kalman filter and performs recognition of the agent's intended waypoint at high frequency. The approach is robust to uncertainties in dynamics and observations. Moreover, it does not require the agent to reach the next waypoint to perform recognition, which saves valuable time. Our empirical evaluation shows the ability of our proposed algorithm to expedite recognition of both simulated and real-world mobile robots.

10:15-10:20, Paper ThBT10.5	Add to My Program
MARF: Cooperative Multi-Agent Path Finding with Reinforcement Learning and Frenet Lattice in Dynamic Environments

Hu, Tianyang	Zhejiang University
Zhang, Zhen	Zhejiang University
Zhu, Chengrui	Zhejiang University
Xu, Gang	Zhejiang University
Wu, Yuchen	Zhejiang University
Wu, Huifeng	Hangzhou Dianzi University
Liu, Yong	Zhejiang University
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Reinforcement Learning Abstract: Multi-agent path finding (MAPF) in dynamic and complex environments is a highly challenging task. Recent research has often focused on the scalability of the number of robots or the complexity of the environment. Usually, they disregard the robots' physical models or use a differential drive robot. However, this approach fails to adequately capture the kinematic and dynamic constraints of real-world vehicles, particularly those equipped with Ackermann steering in warehousing applications. This paper presents a novel MAPF algorithm that combines reinforcement learning (RL) with a lattice planner. RL provides strong generalization capabilities while maintaining computational efficiency. By incorporating lattice planner trajectories into the action space of the RL framework, agents are capable of generating smooth and feasible paths that respect the kinematic and dynamic constraints. In addition, we adopt a decentralized training and execution framework, where a network of shared value functions enables efficient cooperation among agents during decision-making. Simulation results and real-world experiments in different scenarios demonstrate that our method achieves superior performance in terms of success rate, average speed, extra distance of trajectory, and computing time.

10:20-10:25, Paper ThBT10.6	Add to My Program
Robust Self-Reconfiguration for Fault-Tolerant Control of Modular Aerial Robot Systems

Huang, Rui	National University of Singapore
Tang, Siyu	National University of Singapore
Cai, Zhiqian	National University of Singapore
Zhao, Lin	National University of Singapore
Keywords: Cellular and Modular Robots, Failure Detection and Recovery, Aerial Systems: Applications Abstract: Modular Aerial Robotic Systems (MARS) consist of multiple drone units assembled into a single, integrated rigid flying platform. With inherent redundancy, MARS can self-reconfigure into different configurations to mitigate rotor or unit failures and maintain stable flight. However, existing works on MARS self-reconfiguration often overlook the practical controllability of intermediate structures formed during the reassembly process, which limits their applicability. In this paper, we address this gap by considering the control-constrained dynamic model of MARS and proposing a robust and efficient self-reconstruction algorithm that maximizes the controllability margin at each intermediate stage. Specifically, we develop algorithms to compute optimal, controllable disassembly and assembly sequences, enabling robust self-reconfiguration. Finally, we validate our method in several challenging fault-tolerant self-reconfiguration scenarios, demonstrating significant improvements in both controllability and trajectory tracking while reducing the number of assembly steps. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-Reconfig/

10:25-10:30, Paper ThBT10.7	Add to My Program
Where Are You? Unscented Particle Filter for Single Range Relative Pose Estimation in Unobservable Motion Using UWB and VIO

Durodié, Yuri	Vrije Universiteit Brussel
Convens, Bryan	Vrije Universiteit Brussel
Liu, Gaoyuan	Vrije Universiteit Brussel
Decoster, Thomas	Vrije Universiteit Brussel
Munteanu, Adrian	Vrije Universiteit Brussel
Vanderborght, Bram	Vrije Universiteit Brussel
Keywords: Multi-Robot Systems, Localization, Sensor Fusion Abstract: Real-time relative pose (RP) estimation is a corner- stone for effective multi-agent collaboration. When conventional global positioning infrastructure such as GPS is unavailable, the use of Ultra-Wideband (UWB) technology on each agent provides a practical means to measure inter-agent range, eliminating the need for external hardware installations, due to UWB’s precise range measurements and robust communi- cation capabilities. However, when only a single UWB device per agent is used, the relative pose between the agents can be unobservable, resulting in a complex solution space with multiple possible RPs. In this paper, a novel method is proposed based on an Unscented Particle Filter (UPF) that fuses single UWB ranges with visual-inertial odometry (VIO). The proposed decentralized method solves the multi-modal solution in 3D (4-DoF) for the RP when it is unobservable. Moreover, a pseudo-state is introduced to correct for rotational drift of the agents. Through simulations and experiments involving two robots, the proposed solution was shown to be competitive, but less computationally expensive. Additionally, the proposed solution provides all possible relative poses from the first measurement. The code and link to the video are available https://github.com/y2d2/UPF_RPE.


ThBT11 Regular Session, 314	Add to My Program
Robot Vision 1

Chair: Malis, Ezio	Inria
Co-Chair: Culbertson, Preston	Cornell University

09:55-10:00, Paper ThBT11.1	Add to My Program
Asynchronous Blob Tracker for Event Cameras

Wang, Ziwei	Australian National University
Molloy, Timothy L.	Australian National University
van Goor, Pieter	University of Twente
Mahony, Robert	Australian National University
Keywords: Computer Vision for Automation, Aerial Systems: Perception and Autonomy, Visual Tracking, Event Cameras Abstract: Event-based cameras are popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects such as car headlights or any quickly moving foreground objects generate event blob data. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with an extended Kalman filter to track the event blob state. Our algorithm achieves highly accurate blob tracking, velocity estimation, and shape estimation even under challenging lighting conditions and high-speed motions (> 11000 pixels/s). The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.

10:00-10:05, Paper ThBT11.2	Add to My Program
Deep Height Decoupling for Precise Vision-Based 3D Occupancy Prediction

Wu, Yuan	Nanjing University of Science and Technology
Yan, Zhiqiang	Nanjing University of Science and Tenchnology
Wang, Zhengxue	Nanjing University of Science and Technology
Li, Xiang	Nankai University
Hui, Le	Nanjing University of Science and Technology
Yang, Jian	Nanjing University of Science & Technology
Keywords: Computer Vision for Transportation, Semantic Scene Understanding Abstract: The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decouple the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Source code is released at https://github.com/yanzq95/DHD.

10:05-10:10, Paper ThBT11.3	Add to My Program
RE0: Recognize Everything with 3D Zero-Shot Instance Segmentation

Yan, Xiaohan	Tongji University
Jiang, Zijian	Tongji University
Shuai, Yinghao	Tongji University
Wang, Nan	Tongji University
Song, Xiaowei	Tongji University
Ji, Wenbo	Tongji University
Wu, Ge	Nankai University
He, Jinyu	Xiamen University
Wei, Gang	Tongji University
Wang, Zhicheng	Tongji University
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Embodied Cognitive Science Abstract: Recognizing objects in the 3D world is a significant challenge for robotics. Due to the lack of high-quality 3D data, directly training a general-purpose segmentation model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) have revolutionized the 2D computer vision field with outstanding performance, making the use of VFM to assist 3D perception a promising direction. However, most existing VFM-assisted methods do not effectively address the 2D-3D inconsistency problem or adequately provide corresponding semantic information for 3D instance objects. To address these two issues, this paper introduces a novel framework for 3D zero-shot instance segmentation called RE0. For the given 3D point clouds and multi-view RGB-D images with poses, we leverage the 3D geometric information, projection relationships, and CLIP semantic features. Specifically, we utilize CropFormer to extract mask information from multi-view posed images, combined with projection relationships to assign point-level labels to each point in the point cloud, and achieve instance- level consistency through inter-frame information interaction. Then, we employ projection relationships again to assign CLIP semantic features to the point cloud and achieve aggregation of small-scale point clouds. Notably, RE0 does not require any additional training and can be implemented by supporting only one inference of CropFormer and one inference of CLIP. Experiments on ScanNet200 and ScanNet++ show that our method achieves higher quality segmentation than the previous zero-shot methods. Our codes and demos are available at https://recognizeeverything.github.io/, with only one RTX 3090 GPU required.

10:10-10:15, Paper ThBT11.4	Add to My Program
PTQ4RIS: Post-Training Quantization for Referring Image Segmentation

Jiang, Xiaoyan	Shanghai University of Engineering Science
Yang, Hang	Shanghai University of Engineering Science
Zhu, Kaiying	SenseTime
Qiu, Xihe	Shanghai University of Engineering Science
Zhao, Shibo	Carnegie Mellon University
Zhou, Sifan	Southeast University
Keywords: Robotics in Under-Resourced Settings, Object Detection, Segmentation and Categorization, Semantic Scene Understanding Abstract: Referring Image Segmentation (RIS), aims to segment the object referred by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods tend to explore top-performance models, disregarding considerations for practical applications on resources-limited edge devices. This oversight poses a significant challenge for on-device RIS inference. To this end, we propose an effective and efficient post-training quantization framework termed PTQ4RIS. Specifically, we first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization and propose dual-region quantization (DRQ) and reorder-based outlier-retained quantization (RORQ) to address the quantization difficulties in visual and text encoders. Extensive experiments on three benchmarks with different bits settings (from 8 to 4 bits) demonstrates its superior performance. Importantly, we are the first PTQ method specifically designed for the RIS task, highlighting the feasibility of PTQ in RIS applications.

10:15-10:20, Paper ThBT11.5	Add to My Program
LeAP: Consistent Multi-Domain 3D Labeling Using Foundation Models

Gebraad, Simon	Delft University of Technology
Palffy, Andras	Delft University of Technology
Caesar, Holger	TU Delft
Keywords: Data Sets for Robotic Vision, Sensor Fusion, Deep Learning for Visual Perception Abstract: Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D data is straightforward, manually annotating this data with semantic labels is time-consuming and costly. As a result, labeled 3D datasets have largely been confined to the popular automotive domain due to the abundance of labeled data. Recently, Vision Foundation Models (VFMs) enable open-set semantic segmentation, potentially aiding automatic labeling. However, VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label multi-frame 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio-temporal consistency. A novel 3D Consistency Network (3D-CN) further enhances geometric consistency. Through various experiments, we show that our method can generate high-quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show a significant mIoU increase in semantic segmentation tasks.

10:20-10:25, Paper ThBT11.6	Add to My Program
PlaceFormer: Transformer-Based Visual Place Recognition Using Multi-Scale Patch Selection and Fusion

Kannan, Shyam Sundar	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Localization, Visual Learning, Deep Learning for Visual Perception Abstract: Visual place recognition is a challenging task in the field of computer vision, and autonomous robotics and vehicles, which aims to identify a location or a place from visual inputs. Contemporary methods in visual place recognition employ convolutional neural networks and utilize every region within the image for the place recognition task. However, the presence of dynamic and distracting elements in the image can impact the effectiveness of the place recognition process. Therefore, it is meaningful to focus on the task-relevant regions of the image for improved recognition. In this paper, we present PlaceFormer, a novel transformer-based approach for visual place recognition. PlaceFormer uses patch tokens from the transformer to create global image descriptors, which are then used for image retrieval. To re-rank the retrieved images, PlaceFormer merges the patch tokens from the transformer to form multi-scale patches. Utilizing the transformer's self-attention mechanism, it selects patches that correspond to task-relevant areas in an image. These selected patches undergo geometric verification, generating similarity scores across different patch sizes. Subsequently, the spatial scores from each patch size are fused to produce a final similarity score. This score is then used to re-rank the images initially retrieved using global image descriptors. Extensive experiments on benchmark datasets demonstrate that PlaceFormer outperforms several state-of-the-art methods in terms of accuracy and computational efficiency, requiring less time and memory.

10:25-10:30, Paper ThBT11.7	Add to My Program
Motion-Aware Optical Camera Communication with Event Cameras

Su, Hang	ShanghaiTech University
Gao, Ling	ShanghaiTech University
Liu, Tao	ShanghaiTech University
Kneip, Laurent	ShanghaiTech University
Keywords: Localization, Visual Tracking, Automation Technologies for Smart Cities Abstract: As the ubiquity of smart mobile devices continues to rise, Optical Camera Communication systems have gained more attention as a solution for efficient and private data streaming. This system utilizes optical cameras to receive data from digital screens via visible light. Despite their promise, most of them are hindered by dynamic factors such as screen refreshing and rapid camera motion. CMOS cameras, often serving as the receivers, suffer from limited frame rates and motion-induced image blur, which degrade overall performance. To address these challenges, this paper unveils a novel system that utilizes event cameras. We introduce a dynamic visual marker and design event-based tracking algorithms to achieve fast localization and data streaming. Remarkably, the event camera's unique capabilities mitigate issues related to screen refresh rates and camera motion, enabling a high throughput of up to 114 Kbps in static conditions, and a 1 cm localization accuracy with 1% bit error rate under various camera motions. We plan on open-sourcing the code upon acceptance.


ThBT12 Regular Session, 315	Add to My Program
Applications in the Wild

Chair: Kelasidi, Eleni	NTNU
Co-Chair: Hutter, Marco	ETH Zurich

09:55-10:00, Paper ThBT12.1	Add to My Program
Hybrid State Estimation and Mode Identification of an Amphibious Robot

Amundsen, Herman Biørn	NTNU
Randeni, Supun	Massachusetts Institute of Technology
Bingham, Russell	Pliant Energy Systems Inc
Civit, Carles	Pliant Energy Systems Inc
Filardo, Benjamin Pietro	Pliant Energy Systems Inc
Føre, Martin	NTNU
Kelasidi, Eleni	NTNU
Benjamin, Michael	Massachusetts Institute of Technology
Keywords: Discrete Event Dynamic Automation Systems, Localization, Biologically-Inspired Robots Abstract: C-Ray is an amphibious robot that is capable of swimming in water and crawling on land using its undulating fins, enabling operations in a wide range of environments. The robot can be modeled as a hybrid dynamical system whose dynamics and propulsion change when the robot transitions between water and land. Most importantly, the direction of wave travel in the robot's fins is reversed between its swimming and crawling locomotion styles. To operate autonomously, C-Ray requires both accurate identification of when transitions between water and land occur and robust state estimation in littoral environments where the transition dynamics are highly discontinuous and transient. This paper presents a hybrid observer for estimating continuous states and identifying state-driven mode switches for C-Ray, enabling autonomous water/land-transitions. The proposed observer is a combination of the multiplicative extended Kalman filter (MEKF) and the salted Kalman filter, a newly proposed Kalman filter for mapping state uncertainty during hybrid transitions. We also propose an altitude and sea floor geometry observer and incorporate this directly into the MEKF. The performance is evaluated in simulations.

10:00-10:05, Paper ThBT12.2	Add to My Program
LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments

Wei, Chenfeng	Wuxi Intelligent Control Research Institute, HNU
Wu, Qi	Wuxi Intelligent Control Research Institute, Hunan University
Zuo, Si	Hunan University
Xu, Jiahua	Wuxi Intelligent Control Research Institute, Hunan University
Zhao, Boyang	Tsinghua University
Zeyu, Yang	Hunan University
Guotao, Xie	Hunan University
Shenhong, Wang	Xi'an Jiaotong-Liverpool University
Keywords: Data Sets for Robotic Vision, Object Detection, Segmentation and Categorization, Computer Vision for Transportation Abstract: Abstract— Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which　include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban　environments, which limits the exploration of unstructured　and specialized scenarios, particularly those characterized by　significant dust levels. This paper introduces the LiDARDustX　dataset, which is specifically designed for perception tasks under　high-dust conditions, such as those encountered in mining areas.　The LiDARDustX dataset consists of 30,000 LiDAR frames　captured by six different LiDAR sensors, each accompanied by　3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected　scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have ana- lyzed the impact of dust on perception accuracy and delved into　the causes of these effects. The data and further information can　be accessed at: https://github.com/vincentweikey/LiDARDustX.

10:05-10:10, Paper ThBT12.3	Add to My Program
How about Them Apples: 3D Pose and Cluster Estimation of Apple Fruitlets in a Commercial Orchard

Qureshi, Ans	University of Auckland
Smith, David Anthony James	University of Auckland
Gee, Trevor	The University of Auckland
Ahn, Ho Seok	The University of Auckland, Auckland
McGuinness, Benjamin John	University of Waikato
Downes, Catherine	University of Waikato
Jangali, Rahul	The University of Waikato
Black, Kale	Black Box Technologies LTD
Lim, Shen Hin	University of Waikato
Duke, Mike	Waikato University
MacDonald, Bruce	University of Auckland
Williams, Henry	University of Auckland
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Field Robots Abstract: Aotearoa’s apple industry struggles to maintain the skilled workforce required for fruitlet thinning each year. Skilled labourers play a pivotal role in managing crop loads by precisely thinning fruitlets to achieve the desired spacing for high-quality apple growth. This complex task requires accurate mapping of the fruitlets along each branch. This paper presents a novel vision system capable of mapping the orientation and clustering information of apple fruitlets as a human expert does. The vision system has been validated against data collected from a real-world commercial apple orchard. The results show an improved counting accuracy of 83.97% over prior implementations, an orientation estimate accuracy of 88.1%, and a clustering accuracy of 94.3%. Future work will utilise this information to determine which fruitlets to remove and then robotically thin them from the canopy.

10:10-10:15, Paper ThBT12.4	Add to My Program
Active Semantic Mapping with Mobile Manipulator in Horticultural Environments

Cuaran, Jose	University of Illinois at Urbana-Champaign
Singh Ahluwalia, Kulbir	University of Illinois at Urbana Champaing
Koe, Kendall	University of Illinois Urbana Champaign
Uppalapati, Naveen Kumar	University of Illinois at Urbana-Champaign
Chowdhary, Girish	University of Illinois at Urbana Champaign
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Mapping Abstract: Semantic maps are fundamental for robotics tasks such as navigation and manipulation. They also enable yield prediction and phenotyping in agricultural settings. In this paper, we introduce an efficient and scalable approach for active semantic mapping in horticultural environments, employing a mobile robot manipulator equipped with an RGB-D camera. Our method leverages probabilistic semantic maps to detect semantic targets, generate candidate viewpoints, and compute the corresponding information gain. We present an efficient ray-casting strategy and a novel information utility function that accounts for both semantics and occlusions. The proposed approach reduces total runtime by 8% compared to previous baselines. Furthermore, our information metric surpasses other metrics in reducing multiclass entropy and improving surface coverage, particularly in the presence of segmentation noise. Real-world experiments validate our method's effectiveness but also reveal challenges such as depth sensor noise and varying environmental conditions, requiring further research.

10:15-10:20, Paper ThBT12.5	Add to My Program
Surface Roughness Estimation for Terrain Perception

Ye, Minxiang	Zhejiang Lab
Zhang, Yifei	Beihang University
Gu, Jason	Dalhousie University
Xiang, Senwei	Hangzhou International Innovation Institute, Beihang University,
Kong, Lingyu	Zhejiang Lab
Xie, Anhuan	Zhejiang University
Keywords: Deep Learning for Visual Perception, Vision-Based Navigation, Legged Robots Abstract: Ground terrain perception has become the primary visual task for the robust navigation of intelligent systems in unstructured outdoor environments. However, complex terrain poses a significant challenge to vision-based perception. This work introduces a novel estimation task using RGB images to facilitate low-cost terrain perception in extracting surface roughness information. The proposed task presents both semantic-aware and edge-aware roughness descriptors at the pixel level instead of a single value for a given image. To promote the research on the proposed novel terrain roughness estimation task, we introduce a multimodal synthetic dataset for terrain perception in outdoor scenes, containing multiple terrain categories, diverse viewpoints, different lighting and weather conditions, as well as semantic and roughness annotations. Additionally, inspired by computer graphics, we introduce TRENet, a roughness estimation architecture to model the intrinsic correlation of depth-normal-roughness. We also perform ablation studies on the effect of each component and diverse types of inputs. Extensive evaluations and comparisons demonstrate that our method can effectively predict pixel-wise terrain surface roughness with high accuracy.

10:20-10:25, Paper ThBT12.6	Add to My Program
Automatic Identification of Individual African Leopards in Unlabeled Camera Trap Images (I)

Guo, Cheng	Colorado State University
Miguel, Agnieszka	Seattle University
Maciejewski, Anthony A.	Colorado State University
Keywords: Computer Vision for Automation Abstract: This article describes an algorithm to solve the real-world animal identification problem, i.e., determine the unknown number of K individual animals in a dataset of N unlabeled camera-trap images of African leopards, provided by Panthera. To determine the leopards’ IDs, we propose an effective automated algorithm, that consists of segmenting leopard bodies from images, scoring similarity between image pairs, and clustering followed by verification. To perform clustering, we employ a modified ternary search that uses a novel adaptive k-medoids++ clustering algorithm. The best clustering is determined using an expanded definition of the silhouette score. A new post-clustering verification procedure is used to further improve the quality of a clustering. The algorithm was evaluated using the Panthera dataset that consists of 677 individual leopards taken from 1555 images, and resulted in a clustering with an adjusted mutual information score of 0.958 as compared to 0.864 using a baseline k-medoids++ clustering algorithm.

10:25-10:30, Paper ThBT12.7	Add to My Program
RoadRunner M&M: Learning Multi-Range Multi-Resolution Traversability Maps for Autonomous Off-Road Navigation

Patel, Manthan	ETH Zurich
Frey, Jonas	ETH Zurich
Atha, Deegan	Jet Propulsion Laboratory
Spieler, Patrick	JPL
Hutter, Marco	ETH Zurich
Khattak, Shehryar	NASA Jet Propulsion Laboratory
Keywords: Field Robots, Deep Learning for Visual Perception, Mapping Abstract: Autonomous robot navigation in off-road environments requires a comprehensive understanding of the terrain geometry and traversability. The degraded perceptual conditions and sparse geometric information at longer ranges make the problem challenging especially when driving at high speeds. Furthermore, the sensing-to-mapping latency and the look-ahead map range can limit the maximum speed of the vehicle. Building on top of the recent work RoadRunner, in this work, we address the challenge of long-range (±100m) traversability estimation. Our RoadRunner (M&M) is an end-to-end learning-based framework that directly predicts the traversability and elevation maps at multiple ranges (±50m, ±100m) and resolutions (0.2m, 0.8m) taking as input multiple images and a LiDAR voxel map. Our method is trained in a self-supervised manner by leveraging the dense supervision signal generated by fusing predictions from an existing traversability estimation stack (X-Racer) in hindsight and satellite Digital Elevation Maps. RoadRunner M&M achieves a significant improvement of up to 50% for elevation mapping and 30% for traversability estimation over RoadRunner, and is able to predict in 30% more regions compared to X-Racer while achieving real-time performance. Experiments on various out-of-distribution datasets also demonstrate that our data-driven approach starts to generalize to novel unstructured environments. We integrate our proposed framework in closed-loop with the path planner to demonstrate autonomous high-speed off-road robotic navigation in challenging real-world environments. Project Page-https://leggedrobotics.github.io/roadrunner_mm


ThBT13 Regular Session, 316	Add to My Program
Perception Systems

Chair: Zhu, Pingping	Marshall University
Co-Chair: Hays, James	Georgia Institute of Technology, Argo AI

09:55-10:00, Paper ThBT13.1	Add to My Program
RipGAN: A GAN-Based Rip Current Data Augmentation Method

Qian, Shenyang	UNSW Sydney
Harley, Mitchell Dean	UNSW Sydney
Razzak, Imran	MBZUAI
Song, Yang	University of New South Wales
Keywords: Computer Vision for Automation, Deep Learning Methods, Data Sets for Robotic Vision Abstract: Rip currents are a major hazard on beaches worldwide, and their strong, offshore-directed currents can place even experienced beachgoers at risk of drowning. While it is intuitive to consider developing an automated rip current detection system to assist lifeguards in protecting beachgoers, rip current detection is in its infancy due to the lack of high-quality large-scale annotated rip current datasets. Also, the collection and annotation of rip current images require expert knowledge, which makes it more difficult to build datasets. So, this paper proposes a GAN-based rip current data augmentation method, RipGAN, to improve the performance of rip current detectors by increasing representative training data. To create new training images, RipGAN, has two branches. One is a texture generator that enriches the pattern and texture details of waves, making the image more realistic. The other is a rip generator based on FFFM-Unet. FFFM (Fast Fourier Fusion Module) uses Fast Fourier convolution to fuse the features from the low and the high layers, so as to further optimise the generated image. Furthermore, we trained Yolov8, YOLOv10, DINO and RT-DETR as rip current detectors to prove the effectiveness of RipGAN. The detectors' mAP_50:95 improved by 2.67% on the test set and AP₅₀ by 4.93% on real-scene videos, outperforming other data augmentation methods. Besides, abundant ablation studies have been conducted to further evaluate each component of RipGAN.

10:00-10:05, Paper ThBT13.2	Add to My Program
Points, Images and Texts: Boosting Point Cloud Completion with Multi-Modal Features

Xia, ChengKai	Tongji University
Lu, Fan	Tongji University
Li, Bin	Tongji University
Yu, Guo	Tongji University
Knoll, Alois	Tech. Univ. Muenchen TUM
Chen, Guang	Tongji University
Keywords: Computer Vision for Automation, Visual Learning, Semantic Scene Understanding Abstract: Point cloud completion is crucial for reconstructing accurate shapes in many 3D visual applications. Recent approaches incorporate images into the completion pipeline, introducing geometric clues and global constraints. However, their fusion processes often fail to reconstruct detailed parts and maintain global consistency simultaneously. Except for images, text is another important clue for recognizing the target’s characteristics. Thus, in this work, we propose to combine multiple modalities including points, images and texts for point cloud completion. Specifically, inspired by recently pre-trained large language models, we generate the description texts for images by Visual Question Answering (VQA) models and introduce Visual-Textual Embedding (VTE) models to extract joint features of image-text pairs. Furthermore, we describe the edge geometric patterns by multi-scale edge convolution to guide the refinement of shapes in local areas. Then we adopt cross attention mechanism to effectively fuse multi-modal features and refine the coarse shape. Extensive experiments on the ShapeNet-ViPC benchmark demonstrate our method’s superior performance over previous uni-modal and cross-modal methods.

10:05-10:10, Paper ThBT13.3	Add to My Program
3DWG: 3D Weakly Supervised Visual Grounding Via Category and Instance-Level Alignment

Li, Xiaoqi	Peking University
Liu, Jiaming	Peking University
Han, Nuowei	Beijing University of Posts and Telecommunications
Heng, Liang	Peking University
Guo, Yandong	OPPO Research Institute
Dong, Hao	Peking University
Liu, Yang	Peking University
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Learning Abstract: The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.

10:10-10:15, Paper ThBT13.4	Add to My Program
MPI-Mamba : Cross Propagation Mamba for Multipath Interference Correction

An, Kang	ShenZhen University
Jiang, ZhaoXiang	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Tian, Jindong	Guangdong Laboratory of Artificial Intelligence and Digital Econ
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: Owing to their compact structure, high stability,and low cost, Indirect Time-of-Fligh (IToF) cameras have gained increasing attention in the fields of robotics and automation. However, in real-world scenarios, IToF cameras are affected by multipath interference, which severely degrades imaging quality. Existing learning-based methods for multipath interference correction are all based on CNN architectures and rely on synthetic datasets, leading to poor generalization in real-world scenarios. We proposed an efficient and accurate real data collection scheme and explored the application of Transformer and Mamba in multipath interference correction tasks. Additionally, we introduced a cross-propagation network that integrates Mamba and CNN modules, reducing system complexity to linear levels while achieving superior multipath interference correction compared to state-of-the-art methods.

10:15-10:20, Paper ThBT13.5	Add to My Program
SurgPLAN++: Universal Surgical Phase Localization Network for Online and Offline Inference

Chen, Zhen	Centre for Artificial Intelligence and Robotics (CAIR), Hong Kon
Luo, Xingjian	Centre for Artificial Intelligence and Robotics (CAIR) Hong Kong
Wu, Jinlin	Institute of Automation, Chinese Academy of Sciences
Bai, Long	The Chinese University of Hong Kong
Lei, Zhen	Institute of Automation, Chinese Academy of Sciences
Ren, Hongliang	Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS)
Ourselin, Sebastien	University College London
Liu, Hongbin	Hong Kong Institute of Science & Innovation, Chinese Academy Of
Keywords: Recognition, Visual Learning, Deep Learning for Visual Perception Abstract: Surgical phase recognition is critical for assisting surgeons in understanding surgical videos. Existing studies focused more on online surgical phase recognition, by leveraging preceding frames to predict the current frame. Despite great progress, they formulated the task as a series of frame-wise classification, which resulted in a lack of global context of the entire procedure and incoherent predictions. Moreover, besides online analysis, accurate offline surgical phase recognition is also in significant clinical need for retrospective analysis, and existing online algorithms do not fully analyze the entire video, thereby limiting accuracy in offline analysis. To overcome these challenges and enhance both online and offline inference capabilities, we propose a universal Surgical Phase Localization Network, named SurgPLAN++, with the principle of temporal detection. To ensure a global understanding of the surgical procedure, we devise a phase localization strategy for SurgPLAN++ to predict phase segments across the entire video through phase proposals. For online analysis, to generate high-quality phase proposals, SurgPLAN++ incorporates a data augmentation strategy to extend the streaming video into a pseudo-complete video through mirroring, center-duplication, and down-sampling. For offline analysis, SurgPLAN++ capitalizes on its global phase prediction framework to continuously refine preceding predictions during each online inference step, thereby significantly improving the accuracy of phase recognition. We perform extensive experiments to validate the effectiveness, and our SurgPLAN++ achieves remarkable performance in both online and offline modes, which outperforms state-of-the-art methods. The source code is available at https://github.com/franciszchen/SurgPLAN-Plus.

10:20-10:25, Paper ThBT13.6	Add to My Program
Real-Time LiDAR Point Cloud Compression and Transmission for Resource-Constrained Robots

Cao, Yuhao	Harbin Institute of Technology Shenzhen
Wang, Yu	University of Science and Technology of China
Chen, Haoyao	Harbin Institute of Technology, Shenzhen
Keywords: Robotics in Under-Resourced Settings, Field Robots Abstract: LiDARs are widely used in autonomous robots due to their ability to provide accurate environment structural information. However, the large size of point clouds poses challenges in terms of data storage and transmission. In this paper, we propose a novel point cloud compression and transmission framework for resource-constrained robotic applications, called RCPCC. We iteratively fit the surface of point clouds with a similar range value and eliminate redundancy through their spatial relationships. Then, we use Shape-adaptive DCT (SA-DCT) to transform the unfit points and reduce the data volume by quantizing the transformed coefficients. We design an adaptive bitrate control strategy based on QoE as the optimization goal to control the quality of the transmitted point cloud. Experiments show that our framework achieves compression rates of 40x to 80x while maintaining high accuracy for downstream applications. our method significantly outperforms other baselines in terms of accuracy when the compression rate exceeds 70x. Fur thermore, in situations of reduced communication bandwidth, our adaptive bitrate control strategy demonstrates significant QoE improvements.


ThBT14 Regular Session, 402	Add to My Program
Language Guided Manipulation

Chair: Walter, Matthew	Toyota Technological Institute at Chicago
Co-Chair: Chen, Haonan	University of Illinois at Urbana-Champaign

09:55-10:00, Paper ThBT14.1	Add to My Program
A Shared Autonomy System for Precise and Efficient Remote Underwater Manipulation

Phung, Amy	MIT-WHOI Joint Program
Billings, Gideon	University of Sydney, Australian Center for Field Robotics
Daniele, Andrea F	Toyota Technological Institute at Chicago
Walter, Matthew	Toyota Technological Institute at Chicago
Camilli, Richard	Woods Hole Oceanographic Institution
Keywords: Cognitive Human-Robot Interaction, Perception for Grasping and Manipulation, Virtual Reality and Interfaces, Shared Autonomy and Field Robotics Abstract: Conventional underwater intervention operations using robotic vehicles require expert teleoperators and limit interaction with remote scientists. We present the SHared Autonomy for Remote Collaboration (SHARC) framework that enables novice operators to cooperatively conduct underwater sampling and manipulation tasks. With SHARC, operators can plan and complete manipulation tasks using natural language or hand gestures through a virtual reality (SHARC-VR) interface. The interface provides remote operators with a contextual 3D scene understanding that is updated according to bandwidth availability. Evaluation of the SHARC framework through controlled lab experiments demonstrates that SHARC-VR enables novice operators to complete manipulation tasks in framerate-limited conditions (i.e., 0.1–0.5 frames per second) faster than expert pilots using a conventional topside controller. For both novice and expert users, the SHARC-VR interface also increases the task completion rate and improves sampling precision. The SHARC framework is readily extensible to other hardware architectures, including terrestrial and space systems.

10:00-10:05, Paper ThBT14.2	Add to My Program
E2Map: Experience-And-Emotion Map for Self-Reflective Robot Navigation with Language Models

Kim, Chan	Seoul National University
Kim, Keonwoo	Seoul National University
Oh, Mintaek	Seoul National University
Baek, Hanbi	Seoul National University
Lee, Jiyang	Seoul National University
Jung, Donghwi	Seoul National University
Woo, Soojin	Seoul National University
Woo, Younkyung	Carnegie Mellon University
Tucker, John	Stanford University
Firoozi, Roya	Stanford University
Seo, Seung-Woo	Seoul National University
Schwager, Mac	Stanford University
Kim, Seong-Woo	Seoul National University
Keywords: AI-Enabled Robotics, Learning from Experience, Emotional Robotics Abstract: Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions across a range of tasks, including robotic manipulation and navigation. However, existing methods are primarily designed for static environments and do not leverage the agent's own experiences to refine its initial plans. Given that real-world environments are inherently stochastic, initial plans based solely on LLMs' general knowledge may fail to achieve their objectives, unlike in static scenarios. To address this limitation, this study introduces the Experience-and-Emotion Map (E2Map), which integrates not only LLM knowledge but also the agent's real-world experiences, drawing inspiration from human emotional responses. The proposed methodology enables one-shot behavior adjustments by updating the E2Map based on the agent's experiences. Our evaluation in stochastic navigation environments, including both simulations and real-world scenarios, demonstrates that the proposed method significantly enhances performance in stochastic environments compared to existing LLM-based approaches.

10:05-10:10, Paper ThBT14.3	Add to My Program
Improving Zero-Shot ObjectNav with Generative Communication

Dorbala, Vishnu Sashank	University of Maryland, College Park
Sharma, Vishnu D.	Nokia Bell Labs
Tokekar, Pratap	University of Maryland
Manocha, Dinesh	University of Maryland
Keywords: Agent-Based Systems, Domestic Robotics, AI-Enabled Robotics Abstract: We propose a new method for improving Zero-Shot ObjectNav that aims to utilize potentially available environmental percepts for navigational assistance. Our approach takes into account that the ground agent may have limited and sometimes obstructed view. Our formulation encourages Generative Communication (GC) between an assistive overhead agent with a global view containing the target object and the ground agent with an obfuscated view; both equipped with Vision-Language Models (VLMs) for vision-to-language translation. In this assisted setup, the embodied agents communicate environmental information before the ground agent executes actions towards a target. Despite the overhead agent having a global view with the target, we note a drop in performance (-13% in OSR and -13% in SPL) of a fully cooperative assistance scheme over an unassisted baseline. In contrast, a selective assistance scheme where the ground agent retains its independent exploratory behaviour shows a 10% OSR and 7.65% SPL improvement. To explain navigation performance, we analyze the GC for unique traits, quantifying the presence of hallucination and cooperation. Specifically, we identify the novel linguistic trait of preemptive hallucination in our embodied setting, where the overhead agent assumes that the ground agent has executed an action in the dialogue when it is yet to move, and note its strong correlation with navigation performance. We conduct real-world experiments and present some qualitative examples where we mitigate hallucinations via prompt finetuning to improve ObjectNav performance.

10:10-10:15, Paper ThBT14.4	Add to My Program
Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

Chen, Annie	Stanford University
Lessing, Alec	Stanford
Tang, Andy	Stanford University
Chada, Govind	Stanford University
Smith, Laura	UC Berkeley
Levine, Sergey	UC Berkeley
Finn, Chelsea	Stanford University
Keywords: AI-Based Methods, Autonomous Agents, Legged Robots Abstract: Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot’s controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment- specific engineering or human guidance.

10:15-10:20, Paper ThBT14.5	Add to My Program
Language-Guided Object-Centric Diffusion Policy for Generalizable and Collision-Aware Manipulation

Li, Hang	Technical University of Munich
Feng, Qian	Technical University of Munich
Zheng, Zhi	TUM
Feng, Jianxiang	Technical University of Munich (TUM)
Chen, Zhaopeng	University of Hamburg
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Imitation Learning, Manipulation Planning, Learning from Demonstration Abstract: Learning from demonstrations faces challenges in generalizing beyond the training data and often lacks collision awareness. This paper introduces Lan-o3dp, a language-guided object-centric diffusion policy framework that can adapt to unseen situations such as cluttered scenes, shifting camera views, ambiguous similar objects, while offering training-free collision avoidance and achieving high success rate with few demonstrations. We train diffusion model conditioned on 3D point clouds of task-relevant objects to predict the robot's end-effector trajectories, enabling it to complete the tasks. During inference we incorporate cost optimization into denoising steps to guide the generated trajectory to be collision free. We leverage open-set segmentation to obtain the 3D point clouds of related objects and use a large language model to identify the target objects and possible obstacles by interpreting the user's natural language instructions. To effectively guide the conditional diffusion model using time-independent cost function, we proposed a novel guided generation mechanism based on the estimated clean trajectories. In simulation, we showed that diffusion policy based on the object-centric 3D representation achieves a much higher success rate (68.7%) compared to baselines with simple 2D (39.3%) and 3D scene (43.6%) representations, across 21 challenging RLBench tasks with only 40 demonstrations. In real-world experiments, we extensively evaluated the generalization in various unseen situations and validated the effectiveness of the proposed zero-shot cost-guided collision avoidance.

10:20-10:25, Paper ThBT14.6	Add to My Program
This&That: Language-Gesture Controlled Video Generation for Robot Planning

Wang, Boyang	University of Michigan
Sridhar, Nikhil	University of Michigan
Feng, Chao	University of Michigan - Ann Arbor
Van der Merwe, Mark	University of Michigan
Fishman, Adam	OpenAI
Fazeli, Nima	University of Michigan
Park, Jeong Joon	University of Michigan, Ann Arbor
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Embodied Cognitive Science Abstract: Clear, interpretable instructions are invaluable for complex tasks, helping to clarify goals and anticipate necessary steps. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. Through this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That adds gesture conditioning alongside language to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in complex and uncertain environments. These video predictions are then fed into a behavior cloning architecture dubbed Diffusion Video to Action (DiVA), which outperforms prior state-of-the-art behavior cloning and video-based planning methods by substantial margins.


ThBT15 Regular Session, 403	Add to My Program
Robot Safety

Chair: Vela, Patricio	Georgia Institute of Technology
Co-Chair: Koga, Shumon	Kobe University

09:55-10:00, Paper ThBT15.1	Add to My Program
Quantifying the Risk of Unmapped Associations for Mobile Robot Localization Safety

Chen, Yihe	Illinois Insitute of Technology
Pervan, Boris	Illinois Institute of Technology
Spenko, Matthew	Illinois Institute of Technology
Keywords: Robot Safety, Localization, Integrity Risk, Probability and Statistical Methods Abstract: Integrity risk is a measure of localization safety that accounts for the presence of undetected sensor faults. The metric has been used for decades in aviation and has recently been applied to terrestrial robots operating in life-critical missions. For ground vehicles, integrity risk can be quantified for systems using lidar measurements, where two specific fault types have been identified: miss-association and unmapped association. While miss-association faults, which occur when a correctly extracted feature is associated to the wrong landmark, have been well-studied, the probability of an unmapped association fault, where an incorrectly extracted feature is associated to a landmark, is not well-understood. Namely, previous research has never quantified this value and instead relies on an assumed value, one whose value has not been properly justified. This work is the first to provide a methodology that estimates the risk of unmapped association for each mapped landmark; the paper demonstrates the effect of this probability for both the chi-squared and fixed-lag smoothing methods for integrity monitoring. Data collected in downtown Chicago, IL USA was used to tes

10:00-10:05, Paper ThBT15.2	Add to My Program
Control Strategies for Pursuit-Evasion under Occlusion Using Visibility and Safety Barrier Functions

Zhou, Minnan	University of California, San Diego
Shaikh, Mustafa	University of California, San Diego
Chaubey, Vatsalya	University of California, San Diego
Haggerty, Patrick	General Dynamics Mission Systems
Koga, Shumon	Honda Research and Development
Panagou, Dimitra	University of Michigan, Ann Arbor
Atanasov, Nikolay	University of California, San Diego
Keywords: Sensor-based Control, Vision-Based Navigation, Robot Safety Abstract: This paper develops a control strategy for pursuit-evasion problems in environments with occlusions. We address the challenge of a mobile pursuer keeping a mobile evader within its field of view (FoV) despite line-of-sight obstructions. The signed distance function (SDF) of the FoV is used to formulate visibility as a control barrier function (CBF) constraint on the pursuer's control inputs. Similarly, obstacle avoidance is formulated as a CBF constraint based on the SDF of the obstacle set. While the visibility and safety CBFs are Lipschitz continuous, they are not differentiable everywhere, necessitating the use of generalized gradients. To achieve non-myopic pursuit, we generate reference control trajectories leading to evader visibility using a sampling-based kinodynamic planner. The pursuer then tracks this reference via convex optimization under the CBF constraints. We validate our approach in CARLA simulations and real-world robot experiments, demonstrating successful visibility maintenance using only onboard sensing, even under severe occlusions and dynamic evader movements.

10:05-10:10, Paper ThBT15.3	Add to My Program
Dynamic Gap: Safe Gap-Based Navigation in Dynamic Environments

Asselmeier, Maxwell	Georgia Institute of Technology
Ahuja, Dhruv	Georgia Institute of Technology
Zaro, Abdel	University of California, Berkeley
Abuaish, Ahmad	Georgia Institute of Technology
Zhao, Ye	Georgia Institute of Technology
Vela, Patricio	Georgia Institute of Technology
Keywords: Vision-Based Navigation, Motion and Path Planning, Collision Avoidance Abstract: This paper extends the family of gap-based local planners to unknown dynamic environments through generating provably collision-free properties for hierarchical navigation systems. Existing perception-informed local planners that operate in dynamic environments rely on emergent or empirical robustness for collision avoidance as opposed to performing formal analysis of dynamic obstacles. In addition to this, the obstacle tracking that is performed in these existent planners is often achieved with respect to a global inertial frame, subjecting such tracking estimates to transformation errors from odometry drift. The proposed local planner, dynamic gap, shifts the tracking paradigm to modeling how the free space, represented as gaps, evolves over time. Gap crossing and closing conditions are developed to aid in determining the feasibility of passage through gaps, and a breadth of simulation benchmarking is performed against other navigation planners in the literature where the proposed dynamic gap planner achieves the highest success rate out of all planners tested in all environments.

10:10-10:15, Paper ThBT15.4	Add to My Program
Conformalized Reachable Sets for Obstacle Avoidance with Spheres

Kwon, Yong Seok	University of Michigan
Michaux, Jonathan	University of Michigan
Isaacson, Seth	University of Michigan
Zhang, Bohao	University of Michigan
Ejakov, Matthew	University of Michigan
Skinner, Katherine	University of Michigan
Vasudevan, Ram	University of Michigan
Keywords: Robot Safety, Planning under Uncertainty, Constrained Motion Planning Abstract: Safe motion planning algorithms are necessary for deploying autonomous robots in unstructured environments. Motion plans must be safe to ensure that the robot does not harm humans or damage any nearby objects. Generating these motion plans in real-time is also important to ensure that the robot can adapt to sudden changes in its environment. Many trajectory optimization methods introduce heuristics that balance safety and real-time performance, potentially increasing the risk of the robot colliding with its environment. This paper addresses this challenge by proposing Conformalized Reachable Sets for Obstacle Avoidance With Spheres (CROWS). CROWS is a novel real-time, receding-horizon trajectory planner that generates probablistically-safe motion plans. Offline, CROWS learns a novel neural network-based representation of a sphere-based reachable set that overapproximates the swept volume of the robot's motion. CROWS then uses conformal prediction to compute a confidence bound that provides a probabilistic safety guarantee on the learned reachable set. At runtime, CROWS performs trajectory optimization to select a trajectory that is probabilstically-guaranteed to be collision-free. We demonstrate that CROWS outperforms a variety of state-of-the-art methods in solving challenging motion planning tasks in cluttered environments while remaining collision-free. Code, data, and video demonstrations can be found at url{https://roahmlab.github.io/crows/}.

10:15-10:20, Paper ThBT15.5	Add to My Program
System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles

Chakraborty, Kaustav	University of Southern California
Feng, Zeyuan	Stanford University
Veer, Sushant	NVIDIA
Sharma, Apoorva	NVIDIA
Ivanovic, Boris	NVIDIA
Pavone, Marco	Stanford University
Bansal, Somil	Stanford University
Keywords: Intelligent Transportation Systems, Failure Detection and Recovery, Autonomous Vehicle Navigation Abstract: The safety-critical nature of autonomous vehicle(AV) operation necessitates development of task-relevant algorithms that can reason about safety at the system level and not just at the component level. To reason about the impact of a perception failure on the entire system performance, such task-relevant algorithms must contend with various challenges: complexity of AV stacks, high uncertainty in the operating environments, and the need for real-time performance. To overcome these challenges, in this work, we introduce a Q-network called SPARQ (abbreviation for Safety evaluation for Perception And Recovery Q-network) that evaluates the safety of a plan generated by a planning algorithm, accounting for perception failures that the planning process may have overlooked. This Q-network can be queried during system runtime to assess whether a proposed plan is safe for execution or poses potential safety risks. If a violation is detected, the network can then recommend a corrective plan while accounting for the perceptual failure. We validate our algorithm using the NuPlan-Vegas dataset, demonstrating its ability to handle cases where a perception failure compromises a proposed plan, while the corrective plan remains safe. We observe an overall accuracy and recall of 90% while sustaining a frequency of 42HZ on the unseen testing dataset. We compare our performance to a popular reachability based baseline and analysed some interesting properties of our approach in improving the safety properties of an AV pipeline.

10:20-10:25, Paper ThBT15.6	Add to My Program
Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents

Pizarro Bejarano, Federico	University of Toronto
Brunke, Lukas	University of Toronto
Schoellig, Angela P.	TU Munich
Keywords: Robot Safety, Reinforcement Learning, Machine Learning for Robot Control Abstract: Reinforcement learning (RL) controllers are flexible and performant but rarely guarantee safety. Safety filters impart hard safety guarantees to RL controllers while maintaining flexibility. However, safety filters can cause undesired behaviours due to the separation between the controller and the safety filter, often degrading performance and robustness. In this paper, we analyze several modifications to incorporating the safety filter in training RL controllers rather than solely applying it during evaluation. The modifications allow the RL controller to learn to account for the safety filter. This paper presents a comprehensive analysis of training RL with safety filters, featuring simulated and real-world experiments with a Crazyflie 2.0 drone. We examine how various training modifications and hyperparameters impact performance, sample efficiency, safety, and chattering. Our findings serve as a guide for practitioners and researchers focused on safety filters and safe RL.


ThBT16 Regular Session, 404	Add to My Program
Soft Robotics 1

Chair: Dorsey, Kristen	Northeastern University
Co-Chair: Caldwell, Darwin G.	Istituto Italiano Di Tecnologia

09:55-10:00, Paper ThBT16.1	Add to My Program
Pneumatic Logic Systems for Selectively Operating Distributed Pneumatic Elements

Ferrin Pozuelo, Rafael	National Institute of Advanced Industrial Science and Technology
Tomita, Kohji	National Institute of Advanced Industrial Science AndTechnology
Kamimura, Akiya	National Institute of Advanced Industrial Science and Technology
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Hydraulic/Pneumatic Actuators Abstract: Microfluidic and pneumatic logic systems are valuable for applications such as lab-on-a-chip devices, soft robotics, and factory automation. These systems are particularly advantageous when metal or electronic components are impractical or when there are constraints on the control system volume or weight. This paper introduces a novel individual membrane valve that functions as a set-reset latch and can reduce the number of valves required for some pneumatic or microfluidic logic systems. An application of pneumatic logic systems in soft robotics is the access to multiple tethered pneumatic elements through a reduced number of pneumatic lines. To this end, this paper proposes two pneumatic logic systems capable of selecting among multiple distributed sets of pneumatic elements and operating the elements of the set simultaneously and independently through the different pneumatic lines. The selection is achieved via a sequence of pressure pulses applied on the same lines used afterwards for operation. Two prototypes of these pneumatic logic systems were built and successfully demonstrated, consisting primarily of set-reset membrane valves and powered by binary high/low pressure sources. The first prototype features a hierarchical network with four lines and five sets of three pneumatic elements each; the second prototype features a non-hierarchical network with five lines and twelve sets of four pneumatic elements each.

10:00-10:05, Paper ThBT16.2	Add to My Program
Helical Structured Soft Growing Robot for Hazardous Gas Suction in Inaccessible Environments

Lee, Sanghun	Korea Advanced Institute of Science and Technology
Kim, Nam Gyun	Korea Advanced Institute of Science and Technology
Seo, Dongoh	Korea Advanced Institute of Science and Technology
Park, Shinwoo	KAIST
Ryu, Jee-Hwan	Korea Advanced Institute of Science and Technology
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots Abstract: Immediate removal of hazardous gases is critical for ensuring safety. Traditional methods, such as portable ventilation equipment, are difficult to use when hazardous gases are released in inaccessible environments. In this paper, we propose a novel mechanism that integrates an inflatable helical structure into a soft growing robot. The proposed mechanism is capable of performing suction through its inner channel after navigating complex environments, while maintaining the inherent advantages of the soft growing robot as it grows. The mechanism operates in two phases: a growing phase, in which the robot extends by eversion, and a suction phase, in which suction is performed through the inner channel of the robot. Experiments and demonstrations were conducted to evaluate the performance of the proposed mechanism. The experimental results confirmed the ability to maintain the passageway shape of the inner channel during suction operations and provided a design guideline. The demonstration validated that the mechanism can effectively navigate inaccessible environments and perform suction to remove hazardous gases.

10:05-10:10, Paper ThBT16.3	Add to My Program
Shape-Programming Robotic Reflectors for Wireless Networks

Liu, Yawen	Carnegie Mellon University
Prabhakara, Akarsh	Carnegie Mellon University
Zhu, Jiangyifei	Carnegie Mellon University
Qiao, Shenyi	Carnegie Mellon University
Kumar, Swarun	Carnegie Mellon University
Keywords: Soft Robot Applications, Automation Technologies for Smart Cities, Sensor Networks Abstract: With the increasing use of wireless technologies in robotics for communication, sensing, and localization, the potential benefits of how robotics can complement and enhance wireless systems remain underexplored. This paper explores a novel application of the existing inflatable robots for wireless communication systems by forming a shape-programming, reflective waveguide that enhances the received signal quality for wireless devices. Our primary target is enhancing Low-Power Wide-Area Networks (LP WANs) – where 10-year battery-powered client devices (e.g. energy meters or smart home sensors) connect to cellular-like powered base stations to deliver data. Devices in these networks often experience significant seasonal variability in battery life – even simple obstructions between the device and base station (e.g. due to construction) can shave off years of battery life. We propose MetaMorph, a programmable robotic reflector attached to base stations that enhances signal quality from client devices by enhancing received signal energy with controlled reflections. We investigate the design of the reflector, and our experiments show the ability to improve the signal quality for LP-WAN(LoRa) communication systems demonstrating signal quality and battery-benefits. To our best knowledge, MetaMorph is the first paper to explore how flexible robotics can serve as virtuous reflectors for wireless communication systems.

10:10-10:15, Paper ThBT16.4	Add to My Program
MORF: Magnetic Origami Reprogramming and Folding System for Repeatably Reconfigurable Structures with Fold Angle Control

Unger, Gabriel	University of Pennsylvania
Shenoy, Sridhar	University of Pennsylvania
Li, Tianyu	University of Pennsylvania
Figueroa, Nadia	University of Pennsylvania
Sung, Cynthia	University of Pennsylvania
Keywords: Soft Robot Materials and Design, Soft Robot Applications Abstract: We present the Magnetic Origami Reprogramming and Folding System (MORF), a magnetically reprogrammable system capable of precise shape control, repeated transformations, and adaptive functionality for robotic applications. Unlike current self-folding systems, which often lack re-programmability or lose rigidity after folding, MORF generates stiff structures over multiple folding cycles without degradation in performance. The ability to reconfigure and maintain structural stability is crucial for tasks such as reconfigurable tooling. The system utilizes a thermoplastic layer sandwiched within a thin magnetically responsive laminate sheet, enabling structures to self-fold in response to a combination of external magnetic field and heating. We demonstrate that the resulting folded structures can bear loads over 40 times their own weight and can undergo up to 50 cycles of repeated transformations without losing structural integrity. We showcase these strengths in a reconfigurable tool for unscrewing and screwing bolts and screws of various sizes, allowing the tool to adapt its shape to different bolt sizes while withstanding the mechanical stresses involved. This capability highlights the system’s potential for task-varying, load-bearing applications in robotics, where both versatility and durability are essential.

10:15-10:20, Paper ThBT16.5	Add to My Program
Tunable Leg Stiffness in a Monopedal Hopper for Energy-Efficient Vertical Hopping across Varying Ground Profiles

Chen, Rongqian	George Washington University
Kwon, Jun	University of Pennsylvania
Wu, Kefan	University of Connecticut
Chen, Wei-Hsi	University of Pennsylvania
Keywords: Soft Robot Applications, Legged Robots, Mechanism Design Abstract: We present the design and implementation of HASTA (Hopper with Adjustable Stiffness for Terrain Adaption), a vertical hopping robot with real-time tunable leg stiffness, aimed at optimizing energy efficiency across various ground profiles (a pair of ground stiffness and damping conditions). By adjusting leg stiffness, we aim to maximize apex hopping height, a key metric for energy-efficient vertical hopping. We hypothesize that softer legs perform better on soft, damped ground by minimizing penetration and energy loss, while stiffer legs excel on hard, less damped ground by reducing limb deformation and energy dissipation. Through experimental tests and simulations, we find the best leg stiffness within our selection for each combination of ground stiffness and damping, enabling the robot to achieve maximum steady-state hopping height with a constant energy input. These results support our hypothesis that tunable stiffness improves energy-efficient locomotion in controlled experimental conditions. In addition, the simulation provides insights that could aid in future development of controllers for selecting leg stiffness.

10:20-10:25, Paper ThBT16.6	Add to My Program
Online Learning Based Shape Control for a Soft Manipulator Based on Spatial Features Feedback

Shen, Yi	Huazhong University of Science and Technology
Zhang, Jinghao	Huazhong University of Science and Technology
Yuan, Ye	Huazhong University of Science and Technology
Zhang, Fumin	Hong Kong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: Although soft manipulators are endowed with compliance and flexibility, most control strategies focus on end-effector control and lack shape control ability. This letter aims to design a shape controller for the soft manipulator. Firstly, we establish a modified forward kinematics model (FKM) based on the long-short-term-memory (LSTM) neural network to describe the mapping between actuation inputs and spatial features. The spatial features consist of the backbone curve and contour features. The backbone curve is represented by the piecewise Bézier curve under geometrically continuous constraint. The contour features are extracted from the camera-generated point cloud. Besides, an adaptive online learning based shape controller (OLSC) is designed by online back-propagating shape error. The stability of OLSC is proved based on the Lyapunov theorem. Finally, the random excitation model validation experiment demonstrates the prediction accuracy of the proposed modified FKM, and the shape control experiments in air and water validate the effectiveness of the proposed OLSC.

10:25-10:30, Paper ThBT16.7	Add to My Program
Augmenting Compliance with Motion Generation through Imitation Learning Using Drop-Stitch Reinforced Inflatable Robot Arm with Rigid Joints

Gubbala, Gangadhara Naga Sai	Waseda University
Nagashima, Masato	Waseda University
Mori, Hiroki	Waseda University
Seong, Young Ah	The University of Tokyo
Sato, Hiroki	The University of Tokyo
Niiyama, Ryuma	Meiji University
Suga, Yuki	Waseda University
Ogata, Tetsuya	Waseda University
Keywords: Modeling, Control, and Learning for Soft Robots, Deep Learning Methods, Soft Robot Materials and Design Abstract: Safe physical human-robot collaboration can be possible with soft robots due to their inherent compliance and low inertia. Soft bodies provide passive compliance and adaptability due to their deformations, but these same characteristics also lead to difficulty in dynamic control and mathematical modeling. We focus on motion generation for a 3-DOF (Degree of freedom) inflatable robot arm, consisting of soft inflatable body links and rigid joints. This research explores the limitations of relying only on soft robot compliance for contact-based tasks. Our goal is to generate adaptive motion for contact-based tasks by exploiting the compliance of the soft links. We compare contact-based tasks for the inflatable robot with and without a learning model. This shows improved performance when soft robot compliance is augmented with imitation learning. The combination of soft robot compliance and the machine learning model's adaptability shows the potential for collaborative robots to interact with humans and their surroundings safely.


ThBT17 Regular Session, 405	Add to My Program
Planning, Scheduling and Coordination

Chair: Pecora, Federico	Amazon Robotics
Co-Chair: Rastgoftar, Hossein	University of Arizona

09:55-10:00, Paper ThBT17.1	Add to My Program
Safe Human-UAS Collaboration from High-Level Planning to Low-Level Tracking (I)

Rastgoftar, Hossein	University of Arizona
Keywords: Planning, Scheduling and Coordination, Intention Recognition, Aerial Systems: Applications Abstract: This paper studies the problem of safe human-uncrewed aerial system (UAS) collaboration in a shared work environment. By considering human and UAS as co-workers, we use Petri Nets to abstractly model evolution of shared tasks assigned to human and UAS co-workers. Particularly, the Petri Nets’ “places” represent work stations; therefore, the Petri Nets’ transitions can formally specify displacements between the work stations. The paper’s first objective is to incorporate uncertainty regarding the intentions of human co-workers into motion planning for UAS, when UAS closely interacts with human co-workers. To this end, the proposed Petri Nets model uses “conflict” constructs to represent situations at which UAS deals with incomplete knowledge about human co-worker intention. The paper’s second objective is then to plan the motion of the UAS in a resilient and safe manner, in the presence of non-cooperative human co-workers. In order to achieve this objective, UAS equipped with onboard perception and decision-making capabilities are able to, through real-time processing of in-situ observation, predict human intention, quantify human distraction, and apply a non-stationary Markov Decision Process (MDP) model to safely plan UAS motion in the presence of uncertainty. Given the current and next UAS waypoints, the paper applies Potryagin’s minimal principle to plan the desired trajectory of the UAS and uses feedback linearaztion method for trajectory tracking control.

10:00-10:05, Paper ThBT17.2	Add to My Program
Reliable and Efficient Multi-Agent Coordination Via Graph Neural Network Variational Autoencoders

Meng, Yue	Massachusetts Institute of Technology
Majcherczyk, Nathalie	Worcester Polytechnic Institute
Liu, Wenliang	Amazon
Kiesel, Scott	Amazon
Fan, Chuchu	Massachusetts Institute of Technology
Pecora, Federico	Amazon Robotics
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems, Deep Learning Methods Abstract: Multi-agent coordination is crucial for reliable multi-robot navigation in shared spaces such as automated warehouses. In regions of dense robot traffic, local coordination methods may fail to find a deadlock-free solution. In these scenarios, it is appropriate to let a central unit generate a global schedule that decides the passing order of robots. However, the runtime of such centralized coordination methods increases significantly with the problem scale. In this paper, we propose to leverage Graph Neural Network Variational Autoencoders (GNN-VAE) to solve the multi-agent coordination problem faster than through centralized optimization at scale. We formulate the coordination problem as a graph problem and collect ground truth data using a Mixed-Integer Linear Program (MILP) solver. During training, our learning framework encodes good quality solutions of the graph problem into a latent space. At inference time, solution samples are decoded from the sampled latent variables, and the lowest-cost sample is selected for coordination. By construction, our GNN-VAE framework returns solutions that always respect the constraints of the considered coordination problem. Numerical results show that our approach trained on small-scale problems can achieve high-quality solutions even for large-scale problems with 250 robots, being much faster than other baselines.

10:05-10:10, Paper ThBT17.3	Add to My Program
Efficient Cross-Boundary Grasping in Stacked Clutter with Single-Visual Mapping Multi-Step

Luo, Yudong	Dalian Maritime University
Wang, Tong	Dalian Martime University
Xie, Feiyu	Dalian Maritime University
Zhao, Na	Dalian Maritime University
Fu, Xianping	Dalian Maritime University
Shen, Yantao	University of Nevada, Reno
Keywords: Logistics, Factory Automation Abstract: In logistics applications, the vision-based technology for grasping target objects in the air is relatively mature. However, when operating across the air and water, such as grasping marine products from the water, the visual information collected by the camera will be disturbed by ripples and bubbles on the water surface, resulting in low grasping efficiency. Therefore, we introduce a grasping strategy based on single-visual mapping for multi-step (SVMMS) operations, which is suitable for cross-medium operations involving stacked objects. Specifically, we design a multifunctional integrated network model based on Deep Q-learning, which extracts visual features from the scene to detect stacked objects and outputs their hierarchical relationships effectively. Moreover, we quantify the potential relationship between motion logic during action execution and changes in RGB-D information to help the robot achieve efficient and collision-free operations. Our approach also incorporates a time-series design with prioritized experience replay to optimize the action sequence globally. Additionally, we propose a novel sim2real method by combining domain randomization to address the difference in object sizes between the simulation and the real world. Extensive experiments in both simulation and physical environments show that SVMMS-Grasp significantly outperforms existing methods regarding task success rate, stability, and operational efficiency.

10:10-10:15, Paper ThBT17.4	Add to My Program
Efficient Second-Order Cone Programming for the Close Enough Traveling Salesman Problem

Gutow, Geordan	Carnegie Mellon University
Choset, Howie	Carnegie Mellon University
Keywords: Planning, Scheduling and Coordination, Optimization and Optimal Control, Motion and Path Planning Abstract: When agents must execute multiple tasks at spatially distinct locations, it is common to formulate and solve a Traveling Salesman Problem (TSP) to find the order of locations (targets) that requires the smallest travel cost. Approaching such task sequencing problems as a TSP is restrictive, as it requires that unique locations be specified for each task. In reality a set of acceptable locations might be available. The Close Enough Traveling Salesman Problem (CETSP) is a generalization of the Traveling Salesman Problem in which the agent needs only visit a spherical neighborhood surrounding each target, and can thus address this task sequencing problem when any location in a sphere is acceptable. Prior work has developed a branch-and-bound approach that finds globally optimal solutions to instances of the CETSP by solving a sequence of Second-Order Cone Programs (SOCP). We demonstrate it is possible to eliminate 2/3 of the variables and 1/2 of the constraints in these SOCPs, show how to reuse computation and memory allocation across multiple SOCPs in the sequence, and propose a strategy to warm-start the SOCPs using solutions obtained earlier in the sequence. Collectively, these three changes halve the time required to solve 210 random CETSP instances to optimality. We also obtained improved lower bounds on 73 instances from the literature, including solving one instance to optimality for the first time.

10:15-10:20, Paper ThBT17.5	Add to My Program
Decoupled Training Neural Solver for Dynamic Traveling Salesman Problem

Lin, Shaoheng	South China University of Technology
Cui, Hanyun	South China University of Technology
Yang, Wang	South China University of Technology
Jia, Ya-Hui	South China University of Technology
Keywords: Planning, Scheduling and Coordination, Planning under Uncertainty, Task Planning Abstract: Deep reinforcement learning (DRL) methods have achieved remarkable success in solving static traveling salesman problems (TSP). However, dynamic TSP (DTSP), with the random appearance of new customers over time, introduces additional complexities that challenge DRL methods by the difficulty of obtaining optimized routing policy which lead to sub-optimal results and reduced training efficiency. To address these issues, we propose a decoupled training neural solver (DTNS) based on the encoder-decoder architecture, which is a novel approach that decouples the optimization of encoder and decoder, enhancing the model's ability to handle dynamic changes. Our method involves training under an Fore-Reveal condition first where the information of all customers nodes are known in advance to obtain optimized encoder and initialization for decoder and then fine-tuning the decoder in dynamic scenarios where dynamic customers are revealed over time. This training paradigm results in a flexible and globally optimized routing policy. Experimental results demonstrate that DTNS efficiently adapts to new customer requests in dynamic scenario, outperforming existing methods in dynamic routing environments.

10:20-10:25, Paper ThBT17.6	Add to My Program
Multi-Drone-Truck Collaborative Delivery with En Route Operations: A Hierarchical MARL-Based Approach

Hu, Shun	Tongji University
Li, Bing	Tongji University
Zhang, Rongqing	Tongji University
Keywords: Planning, Scheduling and Coordination, Distributed Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: The multi-drone-truck collaborative delivery, where unmanned trucks serve as mobile supply stations for drones, effectively combines the strengths of both vehicles and presents wide application prospects. But the majority of existing literature restricts drone launch and retrieve operations (LARO) to stationary truck, and potential drone route collisions are mostly ignored. This leads to inability to fully exploit the capability of drones. We address these gaps and introduce a new variant of multi-drone-truck collaborative delivery. However, the scheduling for drones and truck faces high-dimensional solution space and complex constraints, making it almost impossible for centralized solving. To this end, we develop a hierarchical solution framework that decomposes the complete problem into two levels of subproblem. The upper solver centrally allocates tasks and schedules when drones to launch, while the lower solver, based on multi-agent reinforcement learning (MARL), plans paths for each drone agent in a decentralized but cooperative manner. In addition, we validate the effectiveness of our method by benchmarking it against three state-of-the-art approaches, demonstrating its superiority in terms of both efficiency and collision avoidance.

10:25-10:30, Paper ThBT17.7	Add to My Program
Risk-Aware Energy-Constrained UAV-UGV Cooperative Routing Using Attention-Guided Reinforcement Learning

Mondal, Mohammad Safwan	University of Illinois Chicago
Ramasamy, Subramanian	University of Illinois at Chicago
Rownak, Ragib	University of Illinois Chicago
Russo, Luca	University of Illinois at Chicago
Humann, James	DEVCOM Army Research Laboratory,
James, Dotterweich, Jim	Army Research Laboratory
Bhounsule, Pranav	University of Illinois at Chicago
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems, Autonomous Agents Abstract: Maximizing the endurance of unmanned aerial vehicles (UAVs) in large-scale monitoring missions spanning over large areas requires addressing their limited battery capacity. Deploying unmanned ground vehicles (UGVs) as mobile recharging stations offers a practical solution, extending UAVs’ operational range. This introduces the challenge of optimizing UAV-UGV routes for efficient mission point coverage and seamless recharging coordination. In this paper, we present a risk-aware deep reinforcement learning (Ra-DRL) framework with a multi-head attention mechanism within an encoder-decoder transformer architecture to solve this cooperative routing problem for a UAV-UGV team. Our model minimizes mission time while accounting for the stochastic fuel consumption of the UAV, influenced by environmental factors like wind velocity, ensuring adherence to a risk threshold to avoid mid-mission energy depletion. Extensive evaluations on various problem sizes show that our method significantly outperforms nearest-neighbor heuristics in both solution quality and risk management. We validate the Ra-DRL policy in a Gazebo-ROS SITL environment with a PX4-based custom UAV and Clearpath Husky UGV. The results demonstrate the robustness and adaptability of our policy, making it highly effective for mission planning in dynamic, uncertain scenarios.


ThBT18 Regular Session, 406	Add to My Program
RADAR-Based Navigation

Chair: Khattak, Shehryar	NASA Jet Propulsion Laboratory
Co-Chair: Heidingsfeld, Michael	CARIAD SE

09:55-10:00, Paper ThBT18.1	Add to My Program
Ground-Aware Automotive Radar Odometry

Casado Herraez, Daniel	University of Bonn & CARIAD SE
Kaschner, Franz	Technical University of Munich
Zeller, Matthias	CARIAD SE
Muhle, Dominik	Technical University of Munich
Behley, Jens	University of Bonn
Heidingsfeld, Michael	CARIAD SE
Cremers, Daniel	Technical University of Munich
Stachniss, Cyrill	University of Bonn
Keywords: SLAM, Localization, Autonomous Vehicle Navigation Abstract: Odometry is crucial for the navigation of autonomous vehicles in unknown environments. While cameras and LiDARs are commonly used to estimate the ego-motion of a vehicle, these sensors face limitations under bad lighting and severe weather conditions. Automotive radars overcome these challenges, but radar point clouds are generally sparse and noisy, making it difficult to identify useful features within a radar scan. In this paper, we address the problem of ego-motion estimation using a single automotive radar sensor. We propose a simple, yet effective, heuristic-based method to extract the ground plane from single radar scans and perform ground plane matching between consecutive scans. Additionally, we perform a windowed factor-graph optimization of the poses together with the ground plane, improving the accuracy of the pose estimation. We put our work to the test using the 4DRadarDataset. Our findings illustrate the state-of-the-art performance of our odometry approach compared to existing alternatives that use radar point clouds.

10:00-10:05, Paper ThBT18.2	Add to My Program
CAO-RONet: A Robust 4D Radar Odometry with Exploring More Information from Low-Quality Points

Li, Zhiheng	Northeastern University
Cui, Yubo	Northeastern University
Huang, Ningyuan	Northeastern University
Pang, Chenglin	Northeastern University
Fang, Zheng	Northeastern University
Keywords: Localization, SLAM, Visual Learning Abstract: Recently, 4D millimetre-wave radar exhibits more stable perception ability than LiDAR and camera under adverse conditions (e.g. rain and fog). However, low-quality radar points hinder its application, especially the odometry task that requires a dense and accurate matching. To fully explore the potential of 4D radar, we introduce a learning-based odometry framework, enabling robust ego-motion estimation from finite and uncertain geometry information. First, for sparse radar points, we propose a local completion to supplement missing structures and provide denser guideline for aligning two frames. Then, a context-aware association with a hierarchical structure flexibly matches points of different scales aided by feature similarity, and improves local matching consistency through correlation balancing. Finally, we present a window-based optimizer that uses historical priors to establish a coupling state estimation and correct errors of inter-frame matching. The superiority of our algorithm is confirmed on View-of-Delft dataset, achieving around a 50% performance improvement over previous approaches and delivering accuracy on par with LiDAR odometry. The code will be released at https://github.com/NEU-REAL/CAO-RONet.

10:05-10:10, Paper ThBT18.3	Add to My Program
Radar Teach and Repeat: Architecture and Initial Field Testing

Qiao, Xinyuan	University of Toronto
Krawciw, Alec	University of Toronto
Lilge, Sven	University of Toronto
Barfoot, Timothy	University of Toronto
Keywords: Field Robots, Autonomous Vehicle Navigation, Localization Abstract: Frequency-modulated continuous-wave (FMCW) scanning radar has emerged as an alternative to spinning LiDAR for state estimation on mobile robots. Radar's longer wavelength is less affected by small particulates, providing operational advantages in challenging environments such as dust, smoke, and fog. This paper presents Radar Teach and Repeat (RT&R): a full-stack radar system for long-term off-road robot autonomy. RT&R can drive routes reliably in off-road cluttered areas without any GPS. We benchmark the radar system's closed-loop path-tracking performance and compare it to its 3D LiDAR counterpart. 11.8 km of autonomous driving was completed without interventions using only radar and gyro for navigation. RT&R was evaluated on four different routes with progressively less structured scene geometry. RT&R achieved lateral path-tracking root mean squared errors (RMSE) of 5.6 cm, 7.5 cm, and 12.1 cm as the routes became more challenging. These RMSE values are less than half of the width of one tire (24 cm) on our robot testing platform. These same routes have worst-case errors of 21.7 cm, 24.0 cm, and 43.8 cm. We conclude that radar is a viable alternative to LiDAR for long-term autonomy in challenging off-road scenarios. The implementation of RT&R is open-source and available at: https://github.com/utiasASRL/vtr3.

10:10-10:15, Paper ThBT18.4	Add to My Program
Structure-Aware Radar-Camera Depth Estimation

Zhang, Fuyi	Zhejiang University
Yu, Zhu	Zhejiang University
Li, ChunHao	Zhejiang University
Zhang, Runmin	Zhejiang University
Bai, Xiaokai	Zhejiang University
Zhou, Zili	Zhejiang University
Cao, Siyuan	Zhejiang University
Wang, Fang	Hangzhou City University
Shen, Hui-liang	Zhejaing University
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Learning Abstract: Radar has gained much attention in autonomous driving due to its accessibility and robustness. However, its standalone application for depth perception is constrained by issues of sparsity and noise. Radar-camera depth estimation offers a more promising complementary solution. Despite significant progress, current approaches fail to produce satisfactory dense depth maps, due to the unsatisfactory processing of the sparse and noisy radar data. They constrain the regions of interest for radar points in rigid rectangular regions, which may introduce unexpected errors and confusions. To address these issues, we develop a structure-aware strategy for radar depth enhancement, which provides more targeted regions of interest by leveraging the structural priors of RGB images. Furthermore, we design a Multi-Scale Structure Guided Network to enhance radar features and preserve detailed structures, achieving accurate and structure-detailed dense metric depth estimation. Building on these, we propose a structure-aware radar-camera depth estimation framework, named SA-RCD. Extensive experiments demonstrate that our SA-RCD achieves state-of-the-art performance on the nuScenes dataset. Our code will be available at https://github.com/FreyZhangYeh/SA-RCD.

10:15-10:20, Paper ThBT18.5	Add to My Program
Doppler Former: Velocity Supervision of Raw Radar Data

Zhao, Shuo	Megvii
Sun, Wei	Fvidar
Li, Huadong	MEGVII Technique
Jiang, Zhaoying	Southeast University
Keywords: Deep Learning for Visual Perception, Computer Vision for Manufacturing Abstract: Thanks to the high robustness of 4D millimeter-wave radar in various environments, it has been widely applied in the field of autonomous driving. Recent research has increasingly focused on utilizing raw data, as a substitute for the sparse and noisy point cloud data. However, these approaches have not fully exploited the Doppler features present in the raw data. In this paper, we introduce the Doppler Former (DPF) module to efficiently extract velocity information from the target environment. DPF can be seamlessly integrated into most radar perception backbone and enhance their performance in downstream tasks. Additionally, we propose a new backbone, Fully Complex Convolutional Network (FCCN), which is more suitable for raw data. By incorporating the DPF module into FCCN, we achieved state-of-the-art (SOTA) performance on the RADIal dataset, with code available at https://github.com/coconut-zs/Fvidar-DopplerFormer.

10:20-10:25, Paper ThBT18.6	Add to My Program
Robust High-Speed State Estimation for Off-Road Navigation Using Radar Velocity Factors

Nissov, Morten	NTNU
Edlund, Jeffrey	Jet Propulsion Lab
Spieler, Patrick	JPL
Padgett, Curtis	JPL
Alexis, Kostas	NTNU - Norwegian University of Science and Technology
Khattak, Shehryar	NASA Jet Propulsion Laboratory
Keywords: Field Robots, Sensor Fusion, Localization Abstract: Enabling robot autonomy in complex environments for mission critical application requires robust state estimation. Particularly under conditions where the exteroceptive sensors, which the navigation depends on, can be degraded by environmental challenges thus, leading to mission failure. It is precisely in such challenges where the potential for Frequency Modulated Continuous Wave (FMCW) radar sensors is highlighted: as a complementary exteroceptive sensing modality with direct velocity measuring capabilities. In this work we integrate radial speed measurements from a FMCW radar sensor, using a radial speed factor, to provide linear velocity updates into a sliding–window state estimator for fusion with LiDAR pose and IMU measurements. We demonstrate that this augmentation increases the robustness of the state estimator to challenging conditions present in the environment and the negative effects they can pose to vulnerable exteroceptive modalities. The proposed method is extensively evaluated using robotic field experiments conducted using an autonomous, full-scale, off-road vehicle operating at high-speeds (~12 m/s) in complex desert environments. Furthermore, the robustness of the approach is demonstrated for cases of both simulated and real-world degradation of the LiDAR odometry performance along with comparison against state-of-the-art methods for radar-inertial odometry on public datasets.


ThBT19 Regular Session, 407	Add to My Program
Active Sensing

Chair: Abraham, Ian	Yale University
Co-Chair: Yau, Wei-Yun	I2R

09:55-10:00, Paper ThBT19.1	Add to My Program
Graph-Based SLAM-Aware Exploration with Prior Topo-Metric Information

Bai, Ruofei	Nanyang Technological University
Guo, Hongliang	Agency for Science Technology and Research
Yau, Wei-Yun	I2R
Xie, Lihua	NanyangTechnological University
Keywords: Planning under Uncertainty, SLAM, Autonomous Vehicle Navigation Abstract: Autonomous exploration requires a robot to explore an unknown environment while constructing an accurate map using SLAM (Simultaneous Localization and Mapping) techniques. Without prior information, the exploration performance is usually conservative due to the limited planning horizon. This paper exploits a prior topo-metric graph of the environment to benefit both the exploration efficiency and the pose graph reliability in SLAM. Based on the relationship between pose graph reliability and graph topology, we formulate a SLAM-aware path planning problem over the prior graph, which finds a fast exploration path enhanced with the globally informative loop-closing actions to stabilize the SLAM pose graph. A greedy algorithm is proposed to solve the problem, in which we derive theoretical thresholds that significantly prune non-optimal loop-closing actions without affecting the potential informative ones. Furthermore, we incorporate the proposed planner into a hierarchical exploration framework, with flexible features including path replanning, and online prior graph update that adds additional information to the prior graph. Simulation and real-world experiments indicate that the proposed method can reliably achieve higher mapping accuracy than compared methods when exploring environments with rich topologies, while maintaining comparable exploration efficiency. Our method is open-sourced on GitHub.

10:00-10:05, Paper ThBT19.2	Add to My Program
Dynamic Multi-Objective Ergodic Path Planning Using Decomposition Methods

Breitfeld, Abigail	Carnegie Mellon University
Wettergreen, David	Carnegie Mellon University
Keywords: Motion and Path Planning, Space Robotics and Automation, Field Robots Abstract: Robots are often employed in hazardous or inaccessible environments, such as disaster sites, extraterrestrial terrains, agricultural fields, and ocean floors. Autonomous operation is crucial in these scenarios to reduce reliance on human operators and enable real-time decision-making. However, robots must balance multiple, often conflicting, objectives. These objectives are subject to change based on new data or evolving conditions. This paper presents a novel approach to dynamic multi-objective trajectory planning. The proposed method leverages the boundary intersection decomposition technique to adaptively plan trajectories that balance multiple evolving objectives. Our approach ensures efficient and effective exploration by continuously optimizing the trade-offs between changing objectives. We show that our method performs on average 34% better in terms of solution quality on the dynamic multi-objective trajectory planning problem as compared to prior work.

10:05-10:10, Paper ThBT19.3	Add to My Program
Rapid Autonomous Exploration of Large-Scale Environments for Ground Robots Based on Region Partitioning

Wen, Zhi	Xidian University
Liu, Xiaotao	Xidian University
Lu, GaoJie	Xidian University
Liu, Jing	Xidian University
Keywords: Motion and Path Planning, Vision-Based Navigation, Wheeled Robots Abstract: Autonomous exploration in large environments often leads to inefficient long backtracking, as distant targets are prioritized over closer ones. To address this issue, in this work, we propose a hierarchical planning method based on region partitioning. The space is dynamically partitioned at a coarse resolution, and as exploration progresses, regions with sufficient known areas are further subdivided to locate unknown areas more precisely. A utility function considering unknown area size, travel distance, and sequence similarity is used, and the simulated annealing algorithm generates a subregion sequence for global guidance. Within each subregion, a linear acceleration model helps select target points. This method reduces computational load and minimizes long-distance backtracking, enabling more efficient high-frequency planning. Extensive simulations and real world tests show that our method significantly improves exploration efficiency compared to existing vision-based techniques.

10:10-10:15, Paper ThBT19.4	Add to My Program
MapEx: Indoor Structure Exploration with Probabilistic Information Gain from Global Map Predictions

Ho, Cherie	Carnegie Mellon University
Kim, Seungchan	Carnegie Mellon University
Moon, Brady	Carnegie Mellon University
Parandekar, Aditya	Birla Institute of Technology and Science, Pilani - Goa Campus
Harutyunyan, Narek	Brown University
Wang, Chen	University at Buffalo
Sycara, Katia	Carnegie Mellon University
Best, Graeme	University of Technology Sydney
Scherer, Sebastian	Carnegie Mellon University
Keywords: Planning under Uncertainty, Integrated Planning and Learning Abstract: Exploration is a critical challenge in robotics, centered on understanding unknown environments. In this work, we focus on structured indoor environments, which often exhibit predictable, repeating patterns. Conventional frontier-based exploration approaches have difficulty leveraging this predictability, relying on simple heuristics such as 'closest first' for exploration. More recent deep learning-based methods predict unknown regions of the map for information gain computation, but these approaches are often sensitive to the predicted map quality or fail to account for sensor coverage. To overcome these issues, our key insight is to jointly reason over what the robot can observe and its uncertainty to calculate probabilistic information gain. We introduce MapEx, a new exploration framework that uses predicted maps to form probabilistic sensor model for information gain estimation. MapEx generates multiple predicted maps based on observed information, and takes into consideration both the computed variances of predicted maps and estimated visible area to estimate the information gain of a given viewpoint. Experiments on the real-world KTH dataset showed on average 12.4% improvement than representative map-prediction based exploration and 25.4% improvement than nearest frontier approach. Website: mapex-explorer.github.io

10:15-10:20, Paper ThBT19.5	Add to My Program
Ergodic Trajectory Optimization on Generalized Domains Using Maximum Mean Discrepancy

Hughes, Christian	Yale University
Warren, Houston	University of Sydney
Lee, Darrick	Univ. of Edinburgh
Ramos, Fabio	University of Sydney, NVIDIA
Abraham, Ian	Yale University
Keywords: Motion and Path Planning, Integrated Planning and Control Abstract: We present a novel formulation of ergodic trajectory optimization that can be specified over general domains using kernel maximum mean discrepancy. Ergodic trajectory optimization is an effective approach that generates coverage paths for problems related to robotic inspection, information gathering problems, and search and rescue. These optimization schemes compel the robot to spend time in a region proportional to the expected utility of visiting that region. Current methods for ergodic trajectory optimization rely on domain-specific knowledge, e.g., a defined utility map, and well-defined spatial basis functions to produce ergodic trajectories. Here, we present a generalization of ergodic trajectory optimization based on maximum mean discrepancy that requires only samples from the search domain. We demonstrate the ability of our approach to produce coverage trajectories on a variety of problem domains including robotic inspection of objects with differential kinematics constraints and on Lie groups without having access to domain specific knowledge. Furthermore, we show favorable computational scaling compared to existing state-of-the-art methods for ergodic trajectory optimization with a trade-off between domain specific knowledge and computational scaling, thus extending the versatility of ergodic coverage on a wider application domain

10:20-10:25, Paper ThBT19.6	Add to My Program
Ergodic Exploration Over Meshable Surfaces

Dong, Dayi, E	University of California Berkeley
Xu, Albert	Carnegie Mellon University
Gutow, Geordan	Carnegie Mellon University
Choset, Howie	Carnegie Mellon University
Abraham, Ian	Yale University
Keywords: Motion and Path Planning, Search and Rescue Robots, Computational Geometry Abstract: Robotic search and rescue, exploration, and inspection require trajectory planning across a variety of domains. A popular approach to trajectory planning for these types of missions is ergodic search, which biases a trajectory to spend time in parts of the exploration domain that are believed to contain more information. Most prior work on ergodic search has been limited to searching simple surfaces, like a 2D Euclidean plane or a sphere, as they rely on projecting functions defined on the exploration domain onto analytically obtained Fourier basis functions. In this paper, we extend ergodic search to any surface that can be approximated by a triangle mesh. The basis functions are approximated through finite element methods on a triangle mesh of the domain. We formally prove that this approximation converges to the continuous case as the mesh approximation converges to the true domain. We demonstrate that on domains where analytical basis functions are available (plane, sphere), the proposed method obtains equivalent results, and while on other domains (torus, bunny, wind turbine), the approach is versatile enough to still search effectively. Lastly, we also compare with an existing ergodic search technique that can handle complex domains and show that our method results in a higher quality exploration.

10:25-10:30, Paper ThBT19.7	Add to My Program
FALCON: Fast Autonomous Aerial Exploration Using Coverage Path Guidance

Zhang, Yichen	The Hong Kong University of Science and Technology
Chen, Xinyi	The Hong Kong University of Science and Technology
Feng, Chen	Hong Kong University of Science and Technology
Zhou, Boyu	Southern University of Science and Technology
Shen, Shaojie	Hong Kong University of Science and Technology
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Motion and Path Planning, Autonomous Exploration Abstract: This paper introduces FALCON, a novel Fast Autonomous expLoration framework using COverage path guidaNce, which aims at setting a new performance benchmark in the field of autonomous aerial exploration. FALCON effectively harnesses the full potential of online generated coverage paths in enhancing exploration efficiency. The framework begins with an incremental connectivity-aware space decomposition and connectivity graph construction. Subsequently, a hierarchical planner generates a coverage path spanning the entire unexplored space, serving as a global guidance. Then, a local planner optimizes the frontier visitation order, consciously incorporating the intention of the global guidance. For fair and comprehensive benchmark experiments, we introduce a lightweight exploration planner evaluation environment that allows for comparing exploration planners across a variety of testing scenarios using an identical quadrotor simulator. Extensive benchmark experiments and ablation studies demonstrate the significant performance of FALCON. Real-world experiments conducted fully onboard further validate FALCON’s practical capability in complex and challenging environments.


ThBT20 Regular Session, 408	Add to My Program
Agricultural Automation 2

Chair: Chowdhary, Girish	University of Illinois at Urbana Champaign
Co-Chair: Cappelleri, David	Purdue University

09:55-10:00, Paper ThBT20.1	Add to My Program
Improving Robotic Fruit Harvesting within Cluttered Environments through 3D Shape Completion

Magistri, Federico	University of Bonn
Pan, Yue	University of Bonn
Bartels, Jake	Queensland University of Technology (QUT)
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Lehnert, Christopher	Queensland University of Technology
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Perception for Grasping and Manipulation Abstract: The world population is increasing and will, by 2050, nearly double its demand for food, feed, fuel, and fiber. Be sides environmental challenges, labor shortage also poses crucial challenges to the agricultural production system. Automation of manual tasks in crop production can potentially increase efficiency but also lead to a change in agricultural practices for more effective usage of available land. In this paper, we address the problem of robotic fruit harvesting in challenging real-world scenarios such as vertical farms, where robotic sensing and acting need to cope with a cluttered environment. Robotic fruit harvesting is typically done by directly detecting a grasp point in the sensor reading, which can lie on the fruit itself or on its peduncle depending on crop harvesting requirements. However, grasp point detection is not always possible as the ideal grasp point may be hidden behind leaves or other fruits. Our approach exploits shape completion techniques allowing us to estimate the complete 3D shape of a target fruit together with its pose even under strong occlusions. In this way, we can estimate a grasp point even when the fruit is only partially visible. We evaluate our approach on a real robotic manipulator operating in a vertical farm growing different fruit species and employing different harvesting tools. Our experiments show that, on average, our proposed pipeline increases the success rate by 18.5 percentage points, in terms of end-effector positioning, compared to the most competitive baseline among the ones reported in this work, that does not rely on shape completion.

10:00-10:05, Paper ThBT20.2	Add to My Program
P-AgSLAM: In-Row and Under-Canopy SLAM for Agricultural Monitoring in Cornfields

Kim, Kitae	Purdue University
Deb, Aarya	Purdue University
Cappelleri, David	Purdue University
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, SLAM Abstract: In this paper, we present an in-row and under-canopy Simultaneous Localization and Mapping (SLAM) framework called the Purdue AgSLAM or P-AgSLAM which is designed for robot pose estimation and agricultural monitoring in cornfields. Our SLAM approach is primarily based on a 3D light detection and ranging (LiDAR) sensor and it is designed for the extraction of unique morphological features of cornfields which have significantly different characteristics from structured indoor and outdoor urban environments. The performance of the proposed approach has been validated with experiments in simulation and in real cornfield environments. P-AgSLAM outperforms existing state-of-the-art LiDAR-based state estimators in robot pose estimations and mapping.

10:05-10:10, Paper ThBT20.3	Add to My Program
Robotic Mushroom Harvesting with Real2Sim2Real and Model Predictive Path Integral (MPPI) Based Planning

Vasios, Konstantinos	University of Essex
Porichis, Antonis	University of Essex
Mohan, Vishwanathan	University of Essex
Chatzakos, Panagiotis	University of Essex AI Innovation Centre
Keywords: Agricultural Automation, Manipulation Planning, Dexterous Manipulation Abstract: We present a strategy for the problem of robotic button mushroom harvesting (Agaricus Bisporus) that involves a Real2Sim2Real pipeline with dynamic scene reconstruction and a Model Predictive Path Integral (MPPI) control & planning architecture for generating optimal uprooting motion primitives based on a physics engine simulation framework. Given the complex, nonlinear, anisotropic material properties of the mushrooms in combination with the multiple failure-mode modalities involved, we design a simulation framework around the PyBullet rigid-body physics engine by utilizing first-order approximations of the equivalent continuum mechanics models. By exploiting the computational efficiency of the aforementioned simulation framework, we directly apply the MPPI control framework to generate offline optimal mushroom uprooting motion primitives, defining a set of cost objectives for an optimal and within-constraint harvesting plan. We show that with this planning strategy, the ``root-bending'' action emerges autonomously for the single mushroom case as an optimal uprooting maneuver, which corresponds well to empirical knowledge obtained by expert pickers. A video demonstration of the proposed architecture can be found in https://youtu.be/k38ePBsBego.

10:10-10:15, Paper ThBT20.4	Add to My Program
Collision-Aware Traversability Analysis for Autonomous Vehicles in the Context of Agricultural Robotics

Philippe, Florian	Université De Haute-Alsace
Laconte, Johann	French National Research Institute for Agriculture, Food and The
Lapray, Pierre-Jean	Université De Haute-Alsace
Spisser, Matthias	Technology & Strategy Engineering SAS
Lauffenburger, Jean-Philippe	Université De Haute-Alsace
Keywords: Agricultural Automation, Sensor Fusion, Collision Avoidance Abstract: In this paper, we introduce a novel method for safe navigation in agricultural robotics. As global environmental challenges intensify, robotics offers a powerful solution to reduce chemical usage while meeting the increasing demands for food production. However, significant challenges remain in ensuring the autonomy and resilience of robots operating in unstructured agricultural environments. Obstacles such as crops and tall grass, which are deformable, must be identified as safely traversable, compared to rigid obstacles. To address this, we propose a new traversability analysis method based on a 3D spectral map reconstructed using a LIDAR and a multispectral camera. This approach enables the robot to distinguish between safe and unsafe collisions with deformable obstacles. We perform a comprehensive evaluation of multispectral metrics for vegetation detection and incorporate these metrics into an augmented environmental map. Utilizing this map, we compute a physics-based traversability metric that accounts for the robot’s weight and size, ensuring safe navigation over deformable obstacles.

10:15-10:20, Paper ThBT20.5	Add to My Program
Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning

Li, Lun	University of Groningen
Kasaei, Hamidreza	University of Groningen
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Imitation Learning Abstract: In agricultural automation, inherent occlusion presents a major challenge for robotic harvesting. We propose an imitation learning-based viewpoint planning approach to actively adjust camera viewpoint and capture unobstructed images of the target crop. Traditional viewpoint planners and existing learning-based methods, depend on manually designed evaluation metrics or reward functions, often struggle to generalize to complex, unseen scenarios. Our method employs the Action Chunking with Transformer (ACT) algorithm to learn effective camera motion policies from expert demonstrations. This enables continuous six-degree-of-freedom (6-DoF) viewpoint adjustments that are smoother, more precise and reveal occluded targets. Extensive experiments in both simulated and real-world environments, featuring agricultural scenarios and a 6-DoF collaborative robot arm equipped with an RGB-D camera, demonstrate our method's superior success rate and efficiency, especially in complex occlusion conditions, as well as its ability to generalize across different crops without reprogramming. This study advances robotic harvesting by providing a practical “learn from demonstration” (LfD) solution to occlusion challenges, ultimately enhancing autonomous harvesting performance and productivity.

10:20-10:25, Paper ThBT20.6	Add to My Program
Precision Harvesting in Cluttered Environments: Integrating End Effector Design with Dual Camera Perception

Koe, Kendall	University of Illinois Urbana Champaign
Shah, Poojan Kalpeshbhai	University of Illinois
Walt, Benjamin	University of Illinois Urbana-Champaign
Westphal, Jordan	University of Illinois at Urbana-Champaign
Marri, Samhita	University of Illinois at Urbana Champaign
Kamtikar, Shivani Kiran	University of Illinois at Urbana-Champaign
Nam, James Seungbum	University of Illinois at Urbana-Champaign
Uppalapati, Naveen Kumar	University of Illinois at Urbana-Champaign
Chowdhary, Girish	University of Illinois at Urbana Champaign
Krishnan, Girish	University of Illinois Urbana Champaign
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Field Robots Abstract: Due to labor shortages in specialty crop industries, a need for robotic automation to increase agricultural efficiency and productivity has arisen. Previous manipulation systems harvest well in uncluttered and structured environments. High tunnel environments are more compact and cluttered in nature, requiring a rethinking of the large form factor systems and grippers. We propose a novel co-designed framework incorporating a global detection camera and a local eye-in-hand camera that demonstrates precise localization of small fruits via closed-loop visual feedback and reliable error handling. Field experiments in high tunnels show that our system can reach 85.0% of cherry tomato fruit in 10.98s on average.

10:25-10:30, Paper ThBT20.7	Add to My Program
S^2BEV: Lightweight, Robust, and Precise SLAM-Oriented Segmentation Bird Eye’s View Mapping Approach

Sun, Yefeng	Shanghai Jiao Tong University
Gong, Liang	Shanghai Jiao Tong University
Dai, Jialing	University of Chinese Academy of Sciences
Bishu, Gao	Shanghai Jiao Tong University
Cai, Jinghan	Shanghai Jiao Tong University
Lin, Gengjie	Shanghai Jiao Tong University
Moutarde, Fabien	MINES ParisTech - PSL University
Lu, Junguo	Shanghai Jiaotong University
Liu, Chengliang	Shanghai Jiao Tong University
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Mapping Abstract: As modern agriculture progresses, the swift deployment of accurate maps becomes essential for the autonomous navigation and operation of orchard robots. Traditional mapping techniques often fall short in addressing the challenges posed by orchards, which are characterized by unstructured, dynamically changing environments with complex spatial and temporal dynamics due to seasonal and continuous operations. This paper proposes a new approach to orchard map construction that merges topological maps with semantic SLAM, which leverages semantic segmentation to discriminate the topological invariant against volatile orchard scenes during mapping. Meanwhile, this integration enables the creation, optimization, and rapid deployment of maps that are not only lightweight and robust but also precise. To evaluate the effectiveness of our method, we performed navigation tests in orchard environments using the newly developed maps. The experimental outcomes demonstrated a significant reduction in CPU usage, with maximum and average reductions of 7.6% and 4.5%, respectively. This approach not only enhances navigation efficiency but also facilitates quicker map deployment, effectively freeing computational resources for other critical tasks.


ThBT21 Regular Session, 410	Add to My Program
Manipulation Planning and Control 2

Chair: Hu, Ai-Ping	Georgia Tech Research Institute
Co-Chair: Misimi, Ekrem	SINTEF Ocean

09:55-10:00, Paper ThBT21.1	Add to My Program
Non-Prehensile Object Transport by Nonholonomic Robots Connected by Linear Deformable Elements

Zhi, Hui	The Hong Kong Polytechnic University
Zhang, Bin	The Hong Kong Polytechnic University
Qi, Jiaming	Centre for Transformative Garment Production, HongKong
Romero Velazquez, Jose Guadalupe	ITAM
Shao, Xiaodong	Beihang University
Yang, Chenguang	University of Liverpool
Navarro-Alarcon, David	The Hong Kong Polytechnic University
Keywords: Motion Control, Constrained Motion Planning, Soft Robot Applications Abstract: This paper presents a new method to automatically transport objects with mobile robots via non-prehensile actions. Our proposed approach utilizes a pair of nonholonomic robots connected by a deformable tube to efficiently manipulate objects of irregular shapes toward target locations. To autonomously perform this task, we develop a local integrated planning and control strategy that solves the problem in two steps (viz. enveloping and transport) based on the model predictive control (MPC) framework. The deformable underactuated system is simplified by a linear kinematic model. The enveloping problem is formulated as the minimization of multiple criteria that represent the enclosing error of the object by the variable morphology system. The transport problem is tackled by formulating the non-prehensile dragging action as an inequality constraint specified by the body frame of the deformable system. Reactive obstacle avoidance is ensured by a maximum margin-based term that utilizes the system's geometry and the feedback proximity to the environment. To validate the performance of the proposed methodology, we report a detailed experimental study with vision-guided robotic prototypes conducting multiple autonomous object transport tasks.

10:00-10:05, Paper ThBT21.2	Add to My Program
Implicit Physics-Aware Policy for Dynamic Manipulation of Rigid Objects Via Soft Body Tools

Wang, Zixing	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Deep Learning in Grasping and Manipulation, Learning from Demonstration, Sensorimotor Learning Abstract: Recent advancements in robot tool use have unlocked their usage for novel tasks, yet the predominant focus is on rigid-body tools, while the investigation of soft-body tools and their dynamic interaction with rigid bodies remains unexplored. This paper takes a pioneering step towards dynamic one-shot soft tool use for manipulating rigid objects, a challenging problem posed by complex interactions and unobservable physical properties. To address these problems, we propose the Implicit Physics-aware (IPA) policy, designed to facilitate effective soft tool use across various environmental configurations. The IPA policy conducts system identification to implicitly identify physics information and predict goal-conditioned, one-shot actions accordingly. We validate our approach through a challenging task, i.e., transporting rigid objects using soft tools such as ropes to distant target positions in a single attempt under unknown environment physics parameters. Our experimental results indicate the effectiveness of our method in efficiently identifying physical properties, accurately predicting actions, and smoothly generalizing to real-world environments. The related video is available at: https://youtu.be/4hPrUDTc4Rg?si=WUZrT2vjLMt8qRWA

10:05-10:10, Paper ThBT21.3	Add to My Program
General-Purpose Clothes Manipulation with Semantic Keypoints

Deng, Yuhong	National University of Singapore
Hsu, David	National University of Singapore
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Representation Learning Abstract: Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints---e.g., "right shoulder", "left sleeve", etc.---a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks---folding, flattening, hanging, and placing---confirm CLASP's performance on a real robot.

10:10-10:15, Paper ThBT21.4	Add to My Program
Robust Optical Transceiver Manipulation in Cluttered Cable Environments Using 3D Scene Understanding and Planning

Sarantopoulos, Iason	Microsoft Research
Liu, Chenyu	Peking University
Weng, Bohong	University of Science and Technology of China
Xu, Sicheng	Microsoft Research Asia
Zhang, Yizhong	Microsoft
Yang, Jiaolong	Microsoft Research
Tong, Xin	MICROSOFT
Otto, Fabian	Microsoft Research
Sweeney, David	Microsoft Research
Chatzieleftheriou, Andromachi	Microsoft
Rowstron, Antony	Microsoft Research
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization, Constrained Motion Planning Abstract: Robotic manipulation in cluttered environments presents significant challenges, particularly when the clutter includes thin, deformable objects like cables, which complicate perception and decision-making processes. In the context of datacenters, the automation of networking tasks often involves the manipulation of optical transceivers within densely packed cable configurations. Such environments are characterized by an abundance of delicate, overlapping, and intersecting cables, leading to frequent occlusions. This paper introduces an innovative system designed for the manipulation of optical transceivers in environments cluttered by cables. Our integrated approach combines advanced 3D scene understanding with a heuristic-based pushing policy to effectively manipulate optical transceivers amidst clutter. The system's perception component utilizes image segmentation and 3D reconstruction to accurately model the transceivers and surrounding cables. Meanwhile, the planning aspect employs a search algorithm with task-specific heuristics, to navigate the gripper, displace obstructing cables, and safely achieve a precise pre-grasp position in front of the target transceiver. We have conducted extensive evaluations of our methodology in both simulated and real-world settings, demonstrating its high success rates, robustness, and proficiency in addressing the unique challenges posed by cable-occluded environments within datacenters.

10:15-10:20, Paper ThBT21.5	Add to My Program
ReloPush: Multi-Object Rearrangement in Confined Spaces with a Nonholonomic Mobile Robot Pusher

Ahn, Jeeho	University of Michigan
Mavrogiannis, Christoforos	University of Michigan
Keywords: Mobile Manipulation, Task and Motion Planning, Manipulation Planning Abstract: We focus on the problem of rearranging a set of objects within a confined space with a nonholonomically constrained mobile robot pusher. This problem is relevant to many real-world domains, including warehouse automation and construction. These domains give rise to instances involving a combination of geometric, kinematic, and physics constraints, which make planning particularly challenging. Prior work often makes simplifying assumptions like the use of holonomic mobile robots or dexterous manipulators capable of unconstrained overhand reaching. Our key insight is we can empower even a constrained mobile pusher to tackle complex rearrangement tasks by enabling it to modify the environment to its favor in a constraint-aware fashion. To this end, we describe a Push-Traversability graph, whose vertices represent poses that the pusher can push objects from and edges represent optimal, kinematically feasible, and stable push-rearrangements of objects. Based on this graph, we develop ReloPush, a planning framework that leverages Dubins curves and standard graph search techniques to generate an efficient sequence of object rearrangements to be executed by the pusher. We evaluate ReloPush across a series of challenging scenarios, involving the rearrangement of densely cluttered workspaces with up to eight objects by a 1tenth mobile robot pusher. ReloPush exhibits orders of magnitude faster runtimes and significantly more robust execution in the real world, evidenced in lower execution times and fewer losses of object contact, compared to two baselines lacking our proposed graph structure.

10:20-10:25, Paper ThBT21.6	Add to My Program
Non-Prehensile Shape Manipulation of Elastoplastic Objects with Reinforcement Learning

Herland, Sverre	Norwegian University of Science and Technology
Misimi, Ekrem	SINTEF Ocean
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning Abstract: We present a novel framework for non-prehensile shape manipulation of deformable objects using Deep Reinforcement Learning. Unlike previous approaches that rely on grasping, our method employs a sequence of gentle pushing actions to deform objects into target shapes. We introduce a continuous parametrization of pushing actions that allows for precise control over pushing trajectories, enabling more flexible and efficient manipulation. The framework is applicable to a wide range of objects by representing them as sampled boundary coordinates, removing the need for predefined object partitions. Trained entirely in simulation, our controller demonstrates zero-shot transfer to real-world scenarios without additional training. Extensive evaluations show that our approach not only matches but substantially exceeds the performance of previous methods, while being more gentle and efficient. We demonstrate successful manipulation across various deformable objects and materials, including food items like salmon and pork loin. This work represents a significant advancement in robotic manipulation of deformable objects, with potential applications in food processing, manufacturing, and beyond.

10:25-10:30, Paper ThBT21.7	Add to My Program
ORLA: Mobile Manipulator-Based Object Rearrangement with Lazy A

Gao, Kai	Rutgers University
Zhaxizhuoma, Zhaxizhuoma	Shanghai Artificial Intelligence Laboratory
Ding, Yan	SUNY Binghamton
Zhang, Shiqi	SUNY Binghamton
Yu, Jingjin	Rutgers University
Keywords: Mobile Manipulation, Task Planning, Manipulation Planning Abstract: Effectively performing object rearrangement is an essential skill for mobile manipulators, e.g., setting up a dinner table. A key challenge in such problems is deciding an appropriate ordering to effectively untangle object-object dependencies while considering the necessary motions for realizing manipulation tasks (e.g., pick and place). Computing time-optimal multi-object rearrangement solutions for mobile manipulators remains a largely untapped research direction. In this work, we propose ORLA, which leverages delayed/lazy evaluation in searching for a high-quality object pick-n-place sequence that considers both end-effector and mobile robot base travel. ORLA readily handles multi-layered rearrangement tasks powered by learning-based stability predictions. Employing an optimal solver for finding temporary locations for displacing objects, ORLA* can achieve global optimality. Through extensive simulation and ablation study, we confirm the effectiveness of ORLA* delivering quality solutions for challenging rearrangement instances. Supplementary materials are available at: https://gaokai15.github.io/ORLA-Star/


ThBT22 Regular Session, 411	Add to My Program
Imitation Learning for Manipulation 1

Chair: Hoffman, Judy	Georgia Tech
Co-Chair: Ravichandar, Harish	Georgia Institute of Technology

09:55-10:00, Paper ThBT22.1	Add to My Program
Learning Prehensile Dexterity by Imitating and Emulating State-Only Observations

Han, Yunhai	Georgia Institute of Technology
Chen, Zhenyang	Georgia Institute of Technology
Williams, Kyle	Sandia National Labs
Ravichandar, Harish	Georgia Institute of Technology
Keywords: Imitation Learning, Dexterous Manipulation Abstract: When human acquire physical skills (e.g., tennis) from experts, we tend to first learn from merely observing the expert. But this is often insufficient. We then engage in practice, where we try to emulate the expert and ensure that our actions produce similar effects on our environment. Inspired by this observation, we introduce Combining IMitation and Emulation for Motion Refinement (CIMER) -- a two-stage framework to learn dexterous prehensile manipulation skills from state-only observations. CIMER's first stage involves imitation: simultaneously encode the complex interdependent motions of the robot hand and the object in a structured dynamical system. This results in a reactive motion generation policy that provides a reasonable motion prior, but lacks the ability to reason about contact effects due to the lack of action labels. The second stage involves emulation: learn a motion refinement policy via reinforcement that adjusts the robot hand's motion prior such that the desired object motion is reenacted. CIMER is both task-agnostic (no task-specific reward design or shaping) and intervention-free (no additional teleoperated or labeled demonstrations). Detailed experiments with prehensile dexterity reveal that i) imitation alone is insufficient, but adding emulation drastically improves performance, ii) CIMER outperforms existing methods in terms of sample efficiency and the ability to generate realistic and stable motions, iii) CIMER can either zero-shot generalize or learn to adapt to novel objects from the YCB dataset, even outperforming expert policies trained with action labels in most cases. Source code and videos are available at https://sites.google.com/view/cimer-2024/.

10:00-10:05, Paper ThBT22.2	Add to My Program
EgoMimic: Scaling Imitation Learning Via Egocentric Video

Kareer, Simar	Georgia Tech
Patel, Dhruv	Georgia Institute of Technology
Punamiya, Ryan	Georgia Institute of Technology
Mathur, Pranay	Georgia Institute of Technology
Cheng, Shuo	Gatech
Wang, Chen	Stanford University
Hoffman, Judy	Georgia Tech
Xu, Danfei	Georgia Institute of Technology
Keywords: Big Data in Robotics and Automation, Imitation Learning, Transfer Learning Abstract: The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework that scales manipulation through egocentric-view human demonstrations. EgoMimic achieves this through: (1) an ergonomic human data collection system using the Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on hand and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/.

10:05-10:10, Paper ThBT22.3	Add to My Program
Neural Dynamics Augmented Diffusion Policy

Wu, Ruihai	Peking University
Chen, Haozhe	University of Illinois Urbana-Champaign
Zhang, Mingtong	UIUC
Lu, Haoran	Peking University
Li, Yitong	Tsinghua University
Li, Yunzhu	Columbia University
Keywords: Imitation Learning, Model Learning for Control, Machine Learning for Robot Control Abstract: Imitation learning has been proven effective in mimicking demonstrations across various robotic manipulation tasks. However, to develop robust policies, current imitation methods, such as diffusion policy, require training on extensive demonstrations, making data collection labor-intensive. In contrast, model-based planning with dynamics models can effectively cover a sufficient range of configurations using only off-policy data. Yet, without the guidance of expert demonstrations, many tasks are difficult and time-consuming to plan using the dynamics models. Therefore, we take the best of both model learning and imitation learning, and propose neural dynamics augmented imitation learning that covers a large scene configurations with few-shot demonstrations. This method trains a robust diffusion policy in a local support region using few-shot demonstrations and rearranges objects outside this region into it using offline-trained neural dynamics models. Extensive experiments across various tasks in both simulations and real-world scenarios, including granular manipulation, contact-rich task and multi-object interaction task, have demonstrated that trained with only 1 to 30 demonstrations, our proposed method can robustly cover a significantly larger area than the policy trained purely from the demonstrations. Our project page is available at: https://dynamics-dp.github.io/.

10:10-10:15, Paper ThBT22.4	Add to My Program
CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation

Xia, Shangning	Shanghai Jiao Tong University
Fang, Hongjie	Shanghai Jiao Tong University
Lu, Cewu	ShangHai Jiao Tong University
Fang, Hao-Shu	Massachusetts Institute of Technology
Keywords: Imitation Learning, Learning from Demonstration Abstract: Generalization in robotic manipulation remains a critical challenge, particularly when scaling to new environments with limited demonstrations. This paper introduces CAGE, a novel robotic manipulation policy designed to overcome these generalization barriers by integrating the pre-trained visual representation with causal attention mechanism. CAGE utilizes the powerful feature extraction capabilities of the vision foundation model DINOv2, combined with LoRA fine-tuning for robust environment understanding. The policy further employs a causal perceiver for effective token compression and a diffusion-based action head with attention to enhance task-specific fine-grained conditioning. With as few as 50 demonstrations from a single training environment, CAGE achieves robust generalization across diverse visual changes in objects, backgrounds, and viewpoints. Extensive experiments validate that CAGE significantly outperforms existing state-of-the-art RGB/RGB-D-based approaches in various manipulation tasks, especially under large distribution shifts. In similar environments, CAGE offers an average of 42% increase in task completion rate. While all baselines fail in unseen environments, CAGE manages to obtain a 43% completion rate and a 51% success rate in average, marking a substantial advancement toward the practical deployment of robots in real-world settings. Project website: cage-policy.github.io.

10:15-10:20, Paper ThBT22.5	Add to My Program
RoCoDA: Counterfactual Data Augmentation for Data-Efficient Robot Learning from Demonstrations

Ameperosa, Ezra	Georgia Institute of Technology
Collins, Jeremy	Georgia Institute of Technology
Jain, Mrinal	Georgia Institute of Technology
Garg, Animesh	Georgia Institute of Technology
Keywords: Imitation Learning, Bimanual Manipulation, Deep Learning Methods Abstract: Imitation learning in robotics faces significant challenges in generalization due to the complexity of robotic environments and the high cost of data collection. We introduce RoCoDA, a novel method that unifies the concepts of invariance, equivariance, and causality within a single framework to enhance data augmentation for imitation learning. RoCoDA leverages causal invariance by modifying task-irrelevant subsets of the environment state without affecting the policy's output. Simultaneously, we exploit SE(3) equivariance by applying rigid body transformations to object poses and adjusting corresponding actions to generate synthetic demonstrations. We validate RoCoDA through extensive experiments on five robotic manipulation tasks, demonstrating improvements in policy performance, generalization, and sample efficiency compared to state-of-the-art data augmentation methods. Our policies exhibit robust generalization to unseen object poses, textures, and the presence of distractors. Furthermore, we observe emergent behavior such as re-grasping, indicating policies trained with RoCoDA possess a deeper understanding of task dynamics. By leveraging invariance, equivariance, and causality, RoCoDA provides a principled approach to data augmentation in imitation learning, bridging the gap between geometric symmetries and causal reasoning. Project Page: https://rocoda.github.io

10:20-10:25, Paper ThBT22.6	Add to My Program
Conditional Neural Expert Processes for Learning Movement Primitives from Demonstration

Yildirim, Yigit	Bogazici University
Ugur, Emre	Bogazici University
Keywords: Learning from Demonstration, Deep Learning Methods Abstract: Learning from Demonstration (LfD) is a widely used technique for skill acquisition in robotics. However, demonstrations of the same skill may exhibit significant variances, or learning systems may attempt to acquire different means of the same skill simultaneously, making it challenging to encode these motions into movement primitives. To address these challenges, we propose an LfD framework, namely the Conditional Neural Expert Processes (CNEP), that learns to assign demonstrations from different modes to distinct expert networks utilizing the inherent information within the latent space to match experts with the encoded representations. CNEP does not require supervision on which mode the trajectories belong to. We compare the performance of CNEP against widely used and powerful LfD methods such as Gaussian Mixture Models, Probabilistic Movement Primitives, and Stable Movement Primitives and show that our method outperforms these baselines on multimodal trajectory datasets. The results reveal enhanced modeling performance for movement primitives, leading to the synthesis of trajectories that more accurately reflect those demonstrated by experts, particularly when the skill demonstrations include intersection points from various trajectories. We evaluated the CNEP model on two real-robot tasks, namely obstacle avoidance and pick-and-place tasks, that require the robot to learn multi-modal motion trajectories and execute the correct primitives given target environment conditions. We also showed that our system is capable of on-the-fly adaptation to environmental changes via an online conditioning mechanism. Lastly, we believe that CNEP offers improved explainability and interpretability by autonomously finding discrete behavior primitives and providing probability values about its expert selection decisions.

10:25-10:30, Paper ThBT22.7	Add to My Program
PRIME: Scaffolding Manipulation Tasks with Behavior Primitives for Data-Efficient Imitation Learning

Gao, Tian	Stanford University
Nasiriany, Soroush	The University of Austin at Texas
Liu, Huihan	University of Texas, Austin
Yang, Quantao	KTH Royal Institute of Technology
Zhu, Yuke	The University of Texas at Austin
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Deep Learning Methods Abstract: Imitation learning has shown great potential for enabling robots to acquire complex manipulation behaviors. However, these algorithms suffer from high sample complexity in long-horizon tasks, where compounding errors accumulate over the task horizons. We present PRIME (PRimitive-based IMitation with data Efficiency), a behavior primitive-based framework designed for improving the data efficiency of imitation learning. PRIME scaffolds robot tasks by decomposing task demonstrations into primitive sequences, followed by learning a high-level control policy to sequence primitives through imitation learning. Our experiments demonstrate that PRIME achieves a significant performance improvement in multi-stage manipulation tasks, with 10-34% higher success rates in simulation over state-of-the-art baselines and 20-48% on physical hardware.


ThBT23 Regular Session, 412	Add to My Program
Diffusion-Based Visual Perception and Learning

Chair: Brandt, Laura Eileen	Massachusetts Institute of Technology
Co-Chair: Nalpantidis, Lazaros	Technical University of Denmark

09:55-10:00, Paper ThBT23.1	Add to My Program
Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model

Zhang, Ruibin	Zhejiang University
Xue, Donglai	Huzhou Institude of Zhejiang University
Wang, Yuhan	Nanyang Technological University
Geng, Ruixu	University of Science and Technology of China
Gao, Fei	Zhejiang University
Keywords: Range Sensing, Deep Learning Methods, Mapping Abstract: Millimeter wave (mmWave) radars have attracted significant attention from both academia and industry due to their capability to operate in extreme weather conditions. However, they face challenges in terms of sparsity and noise interference, which hinder their application in the field of micro aerial vehicle (MAV) autonomous navigation. To this end, this paper proposes a novel approach to dense and accurate mmWave radar point cloud construction via cross-modal learning. Specifically, we introduce diffusion models, which possess state-of-the-art performance in generative modeling, to predict LiDAR-like point clouds from paired raw radar data. We also incorporate the most recent diffusion model inference accelerating techniques to ensure that the proposed method can be implemented on MAVs. We validate the proposed method through extensive benchmark comparisons and real-world experiments, demonstrating its superior performance and generalization ability. Code and pre-trained models will be available at https://github.com/ZJU-FAST-Lab/Radar-Diffusion.

10:00-10:05, Paper ThBT23.2	Add to My Program
DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model

Jia, Peijin	Tsinghua University
Wen, Tuopu	Tsinghua University
Luo, Ziang	TsingHua University
Yang, Mengmeng	Tsinghua University
Jiang, Kun	Tsinghua University
Liu, ZiYuan	Tsinghua University
Tang, Xuewei	Tsinghua University
Lei, Zhiquan	Tsinghua University
Cui, Le	DIdi Inc
Sheng, Kehua	DIdi Inc
Zhang, Bo	DIdi Inc
Yang, Diange	Tsinghua University
Keywords: Mapping, Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.

10:05-10:10, Paper ThBT23.3	Add to My Program
AVD2: Accident Video Diffusion for Accident Video Description

Li, Cheng	The Hong Kong University of Science and Technology
Zhou, Keyuan	Jilin University
Liu, Tong	Nanjing University of Science and Technology
Wang, Yu	Beijing Institute of Technology
Zhuang, Mingqiao	Fudan University
Gao, Huan-ang	Tsinghua University
Jin, Bu	Institute of Automation, Chinese Academy of Sciences
Zhao, Hao	Tsinghua University
Keywords: Computer Vision for Transportation, Semantic Scene Understanding Abstract: Traffic accidents present complex challenges for autonomous driving, often creating unpredictable scenarios that hinder accurate system interpretation and responses. Therefore, understanding accident scenarios is crucial for improving safety and gaining public trust. However, current methods struggle to fully explain accident causes and preventive actions. In this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating detailed natural language descriptions and reasoning. Additionally, we propose a new approach for augmenting accident video datasets by generating accident videos with a customized diffusion model, resulting in the EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset, a higher-quality, more diverse version of MM-AU. Experimental results demonstrate that using the AVD2 system and training on the EMM-AU dataset achieves state-of-the-art performance in both automated metrics and human evaluations, significantly advancing accident analysis and prevention. Project resources are available at https://an-answer-tree.github.io

10:10-10:15, Paper ThBT23.4	Add to My Program
LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models

Wen, Qiang	The Hong Kong University of Science and Technology
Rao, Zhefan	HKUST
Xing, Yazhou	The Hong Kong University of Science and Technology
Chen, Qifeng	HKUST
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Automation Abstract: Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge for modern digital cameras. Prior approaches have difficulties in recovering fine-grained details and true colors of the scene under extremely low-light environments due to near-to-zero SNR. Meanwhile, diffusion models have shown significant progress towards general domain image generation. In this paper, we propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images. Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules to inject the RAW information into the diffusion denoising process via modulating the intermediate features of UNet. We further observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the low-light image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance. Through extensive experiments on representative datasets, we demonstrate our simple design not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons over strong baselines, which highlight the effectiveness of powerful generative priors for neural ISP under extremely low-light environments.

10:15-10:20, Paper ThBT23.5	Add to My Program
SteeredMarigold: Steering Diffusion towards Depth Completion of Largely Incomplete Depth Maps

Gregorek, Jakub	DTU - Technical University of Denmark
Nalpantidis, Lazaros	Technical University of Denmark
Keywords: RGB-D Perception, Deep Learning for Visual Perception Abstract: Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our source code is publicly available at https://steeredmarigold.github.io.

10:20-10:25, Paper ThBT23.6	Add to My Program
DualDiff: Dual-Branch Diffusion Model for Autonomous Driving with Semantic Fusion

Li, Haoteng	Xi'an Jiaotong University
Yang, Zhao	Xi'an Jiaotong University
Qian, Zezhong	Xi'an Jiaotong University
Zhao, Gongpeng	University of Science and Technology of China
Huang, Yuqi	Xi'an Jiaotong University
Yu, Jun	University of Science and Technology of China
Zhou, Huazheng	Xi'an Jiaotong University
Liu, Longjun	Xi'an Jiaotong University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Visual Learning Abstract: Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

10:25-10:30, Paper ThBT23.7	Add to My Program
Anomalies-By-Synthesis: Anomaly Detection Using Generative Diffusion Models for Off-Road Navigation

Ancha, Siddharth	Massachusetts Institute of Technology
Jiang, Sunshine	Massachusetts Institute of Technology
Manderson, Travis	McGill University
Brandt, Laura Eileen	Massachusetts Institute of Technology
Du, Yilun	MIT
Osteen, Philip	U.S. Army Research Laboratory
Roy, Nicholas	Massachusetts Institute of Technology
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Visual Learning Abstract: In order to navigate safely and reliably in off-road environments, robots must detect anomalies that are out-of- distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels between the original and synthesized images in a learned feature space and detect semantically meaningful edits. Our diffusion-based analysis-by-synthesis method enables accurate anomaly detections for off-road navigation.


ThCT1 Regular Session, 302	Add to My Program
Mobile Manipulation: Planning and Control

Chair: Martín-Martín, Roberto	University of Texas at Austin
Co-Chair: Tsagarakis, Nikos	Istituto Italiano Di Tecnologia

11:15-11:20, Paper ThCT1.1	Add to My Program
EHC-MM: Embodied Holistic Control for Mobile Manipulation

Wang, Jiawen	Peking University
Jin, Yixiang	Samsung Research China – Beijing (SRC-B)
Shi, Jun	Samsung Research China – Beijing (SRC-B)
A, Yong	Samsung Research China – Beijing (SRC-B)
Li, Dingzhe	Beihang University
Sun, Fuchun	Tsinghua University
Luo, Dingsheng	Peking University
Fang, Bin	Beijing University of Posts and Telecommunications / Tsinghua Un
Keywords: Mobile Manipulation, Embodied Cognitive Science, Whole-Body Motion Planning and Control Abstract: Mobile manipulation typically entails the base for mobility, the arm for accurate manipulation, and the camera for perception. The principle of Distant Mobility, Close Grasping(DMCG) is essential for holistic control. We propose Embodied Holistic Control for Mobile Manipulation(EHC-MM) with the embodied function of sig(w): By formulating the DMCG principle as a Quadratic Programming (QP) problem, sig(w) dynamically balances the robot’s emphasis between movement and manipulation with the consideration of the robot's state and environment. In addition, we propose the Monitor-Position-Based Servoing(MPBS) with sig(w), enabling the tracking of the target during the operation. This approach enables coordinated control among the robot's base, arm, and camera, enhancing task efficiency. Through extensive simulations and real-world experiments, our approach significantly improves both the success rate and efficiency of mobile manipulation tasks, achieving a 95.6% success rate in real-world scenarios and a 52.8% increase in time efficiency.

11:20-11:25, Paper ThCT1.2	Add to My Program
BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-Wide Mobile Manipulation

Shah, Rutav	The University of Texas at Austin
Yu, Albert	UT Austin
Zhu, Yifeng	The University of Texas at Austin
Zhu, Yuke	The University of Texas at Austin
Martín-Martín, Roberto	University of Texas at Austin
Keywords: Mobile Manipulation, Big Data in Robotics and Automation, Continual Learning Abstract: To operate at a building scale, service robots must perform long-horizon mobile manipulation tasks by navigating to different rooms, accessing multiple floors, and interacting with a wide and unseen range of everyday objects. We refer to these tasks as Building-wide Mobile Manipulation. To tackle these inherently long-horizon tasks, we introduce BUMBLE, a unified Vision-Language Model (VLM)-based framework integrating open-world RGB-D perception, a wide spectrum of gross-to-fine motor skills, and dual-layered memory. Our extensive evaluation (90+ hours) indicates that BUMBLE outperforms competitive baselines in long-horizon building-wide tasks that require sequencing up to 12 skills, spanning 15 minutes per trial. BUMBLE achieves 47.1% success rate averaged over 70 trials in different buildings, tasks, and scene layouts from various starting locations. Our user study shows 22% higher task satisfaction using our framework compared to state-of-the-art VLM-based mobile manipulation methods. Finally, we show the potential of using increasingly capable foundation models to improve the system performance further. For more information, see https://robin-lab.cs.utexas.edu/BUMBLE/

11:25-11:30, Paper ThCT1.3	Add to My Program
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Liu, Peiqi	New York University
Guo, Zhanqiu	New York University
Warke, Mohit	New York University
Chintala, Soumith	Facebook AI Research
Paxton, Chris	Meta AI
Shafiullah, Nur Muhammad (Mahi)	New York University
Pinto, Lerrel	New York University
Keywords: Semantic Scene Understanding, Mobile Manipulation, Continual Learning Abstract: Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system’s applicability in real-world scenarios where environments frequently change due to human intervention or the robot’s own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot’s environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, a 3X improvement over state-of-the-art static systems.

11:30-11:35, Paper ThCT1.4	Add to My Program
Whole-Body Model Predictive Control for Mobile Manipulation with Task Priority Transition

Wang, Yushi	Tsinghua University
Chen, Ruoqu	Tsinghua University
Zhao, Mingguo	Tsinghua University
Keywords: Mobile Manipulation, Whole-Body Motion Planning and Control Abstract: Mobile manipulators enable a wide range of operations with mobility and advanced manipulation capabilities. Despite their potential, existing approaches typically treat the mobile base and the manipulator separately, thereby limiting the optimality of the system for composite whole-body behaviors. In this work, we present a Whole-Body Model Predictive Control framework for mobile manipulation involving tasks with varying timelines. We integrate task priorities across both task and time dimensions, bringing inherent transition ability with enhanced performance. Our approach improves the trajectory tracking performance by up to 36% in terms of manipulability and reduces the maximum velocity during task priority transitions by 53% compared to the existing approach while maintaining a low computational cost of 4.3ms, allowing for high reactivity in real-world applications. We demonstrate its effectiveness through a door-opening and traversing behavior, showcasing the first successful implementation of a non-holonomic mobile manipulator in such a scenario. See https://wbmpc.github.io/ for supplemental materials.

11:35-11:40, Paper ThCT1.5	Add to My Program
Dynamic Object Goal Pushing with Mobile Manipulators through Model-Free Constrained Reinforcement Learning

Dadiotis, Ioannis	Italian Institute of Technology
Mittal, Mayank	ETH Zurich
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Hutter, Marco	ETH Zurich
Keywords: Mobile Manipulation, AI-Enabled Robotics, Deep Learning in Grasping and Manipulation Abstract: Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object's physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.

11:40-11:45, Paper ThCT1.6	Add to My Program
Door-To-Door Parcel Delivery from Supply Point to Users Home with Heterogeneous Robot Team: EuROBIN First Year Robotics Hackathon

Suarez, Alejandro	University of Seville
Kartmann, Rainer	Karslruhe Institute of Technology (KIT)
Leidner, Daniel	German Aerospace Center (DLR)
Rossini, Luca	Istituto Italiano Di Tecnologia
Huber, Johann	ISIR, Sorbonne Université
Azevedo, Carlos	Instituto Superior Técnico - Institute for Systems and Robotics
Rouxel, Quentin	INRIA
Bjelonic, Marko	ETH Zurich
Gonzalez-Morgado, Antonio	Universidad De Sevilla
Dreher, Christian R. G.	Karlsruhe Institute of Technology (KIT)
Schmaus, Peter	German Aerospace Center (DLR)
Laurenzi, Arturo	Istituto Italiano Di Tecnologia
Hélénon, François	Sorbonne Université
Serra, Rodrigo	Institute for Systems and Robotics / Instituto Superior Técnico
Rochel, Olivier	INRIA Institut National De Recherche En Sciences Et Technologies
Wellhausen, Lorenz	ETH Zürich
Perez Sanchez, Vicente	University of Seville, GRVC
Gao, Jianfeng	Karlsruhe Institute of Technology (KIT)
Bauer, Adrian Simon	German Aerospace Center (DLR)
De Luca, Alessio	Istituto Italiano Di Tecnologia
Abrini, Mouad	Sorbonne University
Bettencourt, Rui	Institute for Systems and Robotics / Instituto Superior Técnico
Mouret, Jean-Baptiste	Inria
Lee, Joonho	Neuromeka
Viana Servan, Pablo	GRVC
Pohl, Christoph	Karlsruhe Institute of Technology (KIT)
Batti, Nesrine	German Aerospace Center (DLR)
Vedelago, Diego	Fondazione Istituto Italiano Di Tecnologia (IIT)
Guda, Vamsi Krishna	Sorbonne University
Carlos, Alvarez Cia	Universidad De Sevilla
Reister, Fabian	Karlsruhe Institute of Technology (KIT)
Friedl, Werner	German AerospaceCenter (DLR)
Burchielli, Corrado	Fondazione Istituto Italiano Di Tecnologia (IIT)
Baudry, Aline	CNRS
Peller-Konrad, Fabian	Karlsruhe Institute of Technology (KIT)
Gumpert, Thomas	German Aerospace Center (DLR)
Muratore, Luca	Istituto Italiano Di Tecnologia
Gauthier, Philippe	Sorbonne Université
Schedl-Warpup, Rebecca	German Aerospace Center (DLR)
Hutter, Marco	ETH Zurich
Ivaldi, Serena	INRIA
Lima, Pedro U.	Instituto Superior Técnico - Institute for Systems and Robotics
Doncieux, Stéphane	Sorbonne University
Tsagarakis, Nikos	Istituto Italiano Di Tecnologia
Asfour, Tamim	Karlsruhe Institute of Technology (KIT)
Ollero, Anibal	AICIA. G41099946
Albu-Schäffer, Alin	DLR - German Aerospace Center
Keywords: Cooperating Robots, Mobile Manipulation, Service Robotics Abstract: Logistics and service operations involving parcel preparation, delivery, and unpacking from a supply point to the user's home could be carried out completely by robots in the near future, taking benefit of the capabilities of the different robot morphologies for the logistics, outdoors, and domestic environments. The use of robots for parcel delivery can contribute to the goals of sustainability and reduced emissions by exploiting the different locomotion modalities (wheeled, legged, and aerial). This paper reports the development and results obtained from the first robotics hackathon celebrated as part of the European Robotics and Artificial Intelligence Network (euROBIN) involving eight robotic platforms in three domains: 1) an industrial robotic arm for parcel preparation at the supply point, 2) a Centauro robot, a dual-arm aerial manipulator, and a wheeled-legged quadruped for parcel transportation, and 3) two humanoid robots and two commercial mobile manipulators for parcel delivery and unpacking in domestic scenarios. The paper describes the joint operation and the evaluation scenario, the features and capabilities of the robots, particularly those involved in the realization of the tasks, and the lessons learned.


ThCT2 Regular Session, 301	Add to My Program
Bio-Inspired Robot Learning

Chair: Tucker, Maegan	Georgia Institute of Technology
Co-Chair: Krichmar, Jeffrey	University of California, Irvine

11:15-11:20, Paper ThCT2.1	Add to My Program
HSRL: A Hierarchical Control System Based on Spiking Deep Reinforcement Learning for Robot Navigation

Yang, Bo	Zhejiang University
Zhou, Shibo	Zhejiang Lab
Lin, Chaohui	Zhejiang Lab
Chai, Qingao	Zhejiang University
Yan, Rui	Zhejiang University of Technology
Ma, De	Zhejiang University
Pan, Gang	Zhejiang University
Tang, Huajin	Zhejiang University, China
Keywords: Bioinspired Robot Learning, Reinforcement Learning, Motion Control Abstract: Reinforcement Learning (RL) has shown promise in robotic navigation tasks, yet applying it to real-world environments remains challenging due to dynamic complexities and the need for dynamically feasible actions. We propose a hierarchical control framework based on Spiking Deep Reinforcement Learning (SDRL) for robust robot navigation in real environments. Our approach utilizes a two-layer architecture: a high-level decision layer powered by a Spiking GRU network for handling partially observable environments, and a low-level executive layer employing Continuous Attractor Neural Networks (CANNs) to ensure precise and continuous actions. This hierarchical structure allows real-time decision-making that respects the physical constraints of the robot. Experimental results show that our method adapts effectively to new environments without fine-tuning and surpasses existing methods in performance. We also explore the implementation on the Darwin3 chip, paving the way for biologically inspired motion control in future robotic applications.

11:20-11:25, Paper ThCT2.2	Add to My Program
Materials Matter: Investigating Functional Advantages of Bio-Inspired Materials Via Simulated Robotic Hopping

Schulz, Andrew	Max Planck Institute for Intelligent System
Ahmad, Ayah	Georgia Institute of Technology
Tucker, Maegan	Georgia Institute of Technology
Keywords: Methods and Tools for Robot System Design, Biologically-Inspired Robots, Simulation and Animation Abstract: In contrast with the diversity of materials found in nature, most robots are designed with some combination of aluminum, stainless steel, and 3D-printed filament. Additionally, robotic systems are typically assumed to follow basic rigid-body dynamics. However, several examples in nature illustrate how changes in physical material properties yield functional advantages. In this paper, we explore how physical materials (non-rigid bodies) affect the functional performance of a hopping robot. In doing so, we address the practical question of how to model and simulate material properties. Through these simulations we demonstrate that material gradients in the leg of a single-limb hopper provide functional advantages compared to homogeneous designs. For example, when considering incline ramp hopping, a material gradient with increasing density provides a 35% reduction in tracking error and a 23% reduction in power consumption compared to homogeneous stainless steel. By providing bio-inspiration to the rigid limbs in a robotic system, we seek to show that future fabrication of robots should look to leverage the material anisotropies of moduli and density found in nature. This would allow for reduced vibrations in the system and would provide offsets of joint torques and vibrations while protecting their structural integrity against reduced fatigue and wear. This simulation system could inspire future intelligent material gradients of custom-fabricated robotic locomotive devices.

11:25-11:30, Paper ThCT2.3	Add to My Program
SHIRE: Enhancing Sample Efficiency Using Human Intuition in REinforcement Learning

Joshi, Amogh	Purdue University
Kosta, Adarsh Kumar	Purdue University
Roy, Kaushik	Purdue University
Keywords: Reinforcement Learning, Bioinspired Robot Learning, Probabilistic Inference Abstract: The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning (DeepRL) has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25−78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.

11:30-11:35, Paper ThCT2.4	Add to My Program
Hyperdimensional Computing-Based Federated Learning in Mobile Robots through Synthetic Oversampling

Lee, Hyunsei	DGIST
Han, WoongJae	DGIST
Kim, Hojeong	DGIST
Kwon, Hyukjun	DGIST
Jang, Shinhyoung	DGIST
Suh, Il Hong	Hanyang University
Kim, Yeseong	DGIST
Keywords: Bioinspired Robot Learning, Networked Robots, Learning from Demonstration Abstract: Traditional federated learning frameworks, often reliant on deep neural networks, face challenges related to computational demands and privacy risks. In this paper, we present a novel Hyperdimensional (HD) Computing-based federated learning framework designed for resource-constrained mobile robots. Unlike other HD-based learning, our approach introduces dynamic encoding, which improves both model accuracy and privacy by continuously updating hypervector representations. To further address the issue of imbalanced data, especially prevalent in robotics tasks, we propose a hypervector oversampling technique, enhancing model robustness. Extensive evaluations on LiDAR-equipped mobile robots demonstrate that our oversampling method outperforms state-of-the-art HD computing frameworks, achieving up to a 22.9% increase in accuracy while maintaining computational efficiency and privacy protection.

11:35-11:40, Paper ThCT2.5	Add to My Program
Brain-Inspired Spatial Continuous State Encoding for Efficient Spiking-Based Navigation

Chai, Qingao	Zhejiang University
Wang, Jiashuo	Zhejiang University
Jiang, Runhao	Zhejiang University
Yang, Bo	Zhejiang University
Yan, Rui	Zhejiang University of Technology
Tang, Huajin	Zhejiang University, China
Keywords: Bioinspired Robot Learning, Reinforcement Learning, Cognitive Modeling Abstract: Spiking neural networks (SNNs) show great potential in mapless navigation tasks due to their low power consumption, but the continuous representation of spatial information poses a challenge to SNN training. Neuroscience findings reveal that spatial cognition cells encode spatial information through population spike patterns. Inspired by this, we propose a navigation method based on SNNs, leveraging spatial cognition cells, which include grid cells (GCs), head direction cells (HDCs), and boundary vector cells (BVCs). Our method integrates spike-based information to achieve precise navigation goal encoding and egocentric environment perception, significantly improving SNN navigation capabilities in complex environments. Simulation and real-world experiments demonstrate that our method achieves significant improvements in navigation success rate and energy efficiency, showcasing superior adaptability across environments. Our work provides a novel approach to developing efficient brain-inspired navigation systems.

11:40-11:45, Paper ThCT2.6	Add to My Program
A Rapid Adapting and Continual Learning Spiking Neural Network Path Planning Algorithm for Mobile Robots

Espino, Harrison	University of California, Irvine
Bain, Robert	University of California Irvine
Krichmar, Jeffrey	University of California, Irvine
Keywords: Neurorobotics, Learning from Experience, Motion and Path Planning Abstract: Mapping traversal costs in an environment and planning paths based on this map are important for autonomous navigation. We present a neurorobotic navigation system that utilizes a Spiking Neural Network (SNN) Wavefront Planner and E-prop learning to concurrently map and plan paths in a large and complex environment. We incorporate a novel method for mapping which, when combined with the Spiking Wavefront Planner (SWP), allows for adaptive planning by selectively considering any combination of costs. The system is tested on a mobile robot platform in an outdoor environment with obstacles and varying terrain. Results indicate that the system is capable of discerning features in the environment using three measures of cost, (1) energy expenditure by the wheels, (2) time spent in the presence of obstacles, and (3) terrain slope. In just twelve hours of online training, E-prop learns and incorporates traversal costs into the path planning maps by updating the delays in the SWP. On simulated paths, the SWP plans significantly shorter and lower cost paths than A* and RRT*. The SWP is compatible with neuromorphic hardware and could be used for applications requiring low size, weight, and power.


ThCT3 Regular Session, 303	Add to My Program
Space Robotics 1

Chair: Naclerio, Nicholas	University of California, Santa Barbara
Co-Chair: Beksi, William J.	The University of Texas at Arlington

11:15-11:20, Paper ThCT3.1	Add to My Program
LuVo: Lunar Visual Odometry Using Homography-Based Image Feature Matching

Soussan, Ryan	Aerodyne Industries
McCaffery, John	KBR, Inc
McMichael, Scott	NASA Ames Research Center, KBR Inc
Deans, Matthew	NASA Ames Research Center
Keywords: Space Robotics and Automation, Vision-Based Navigation Abstract: We present LuVo, an initialization-free stereo visual odometry (VO) method developed for the VIPER lunar rover. We provide a novel stereo registration method using LightGlue image feature matching in a warped, locally planar space that improves matching robustness to larger baseline stereo sequences and repetitive terrain that traditionally challenge odometry approaches. We additionally introduce methods that increase the usable image region for matching by estimating a horizon cutoff in image space and enhance robustness to stereo correspondence failures using a Manhattan distance search for valid stereo points during cloud alignment. We evaluate the performance of LuVo on a dataset of 155 simulated lunar stereo sequences and show that it significantly improves registration accuracy and success rates for clouds separated by both expected driving ranges below eight meters and longer distance translations of up to 16 meters. While LuVo is developed for VIPER, it can be used in other environments featuring slip-prone and repetitive terrain that limit rover travel.

11:20-11:25, Paper ThCT3.2	Add to My Program
Instance Segmentation-Based Hazard Detection with Lunar South Pole Lighting

Cloud, Joseph	NASA Kennedy Space Center
Buckles, Bradley	NASA Kennedy Space Center
Muller, Thomas	Bennett Aerospace, NASA Kennedy Space Center
Beksi, William J.	The University of Texas at Arlington
Schuler, Jason	NASA Kennedy Space Center
Keywords: Space Robotics and Automation, Mining Robotics, Object Detection, Segmentation and Categorization Abstract: This paper addresses rock hazard detection for in-situ resource utilization (ISRU) robotic navigation in the challenging visual environment of the lunar south pole (LSP). We evaluate three state-of-the-art instance segmentation models—Mask~R-CNN, YOLOv8, and SAM—using a novel, synthetically generated dataset that simulates LSP-specific illumination challenges at sun angles of 2.5°, 5°, and 7.5°. Additionally, we evaluate these approaches in both up and down-sun driving with low solar angle light. This study highlights the potential of deep learning-based approaches for improving ISRU operations by reliably identifying visual surface hazards, such as rocks, which may impede robotic navigation and excavation in future lunar missions.

11:25-11:30, Paper ThCT3.3	Add to My Program
Resettable Land Anchor Launcher for Unmanned Rover Rescue and Slope Climbing

Kainth, Aaryan	University of California Santa Barbara
Krohn, Andrew R.	University of California Santa Barbara
Johnson, Kyle A.	NASA Glenn Research Center
Schepelmann, Alexander	Carnegie Mellon University
Hawkes, Elliot Wright	University of California, Santa Barbara
Naclerio, Nicholas	University of California, Santa Barbara
Keywords: Space Robotics and Automation, Mechanism Design Abstract: Unmanned planetary rovers have traversed kilometers of Lunar and Martian terrain while performing valuable science. However, they still face mobility challenges including steep slopes and unstable soil that can entrap vehicles, as demonstrated by NASA’s Spirit rover. Vehicles on Earth can depend on a human operator or rescue vehicle to tow them out of an entrapment, but remote rovers cannot, limiting their route to highly conservative path selections. To increase rover mobility on slopes and unstable soils, we present a resettable anchor launcher for independent self-rescue. The device launches a tethered land anchor away from the rover and then uses a winch to tow the rover up a hill or out of an entrapment. This paper presents the design of the launcher and its integration into a half-meter-long rover mobility platform with field testing at the NASA Glenn Research Center SLOPE Lab. We demonstrate repeatable launching and winching to help the rover climb a 17° slope of loose GRC-1 Lunar regolith simulant that it otherwise could not climb. Our work presents an alternative method to increase rover mobility, especially up slopes, and enables independent rover rescue, which could eventually increase mission duration and reduce risk of entrapment during extraterrestrial exploration.

11:30-11:35, Paper ThCT3.4	Add to My Program
SOF-E: An Energy Efficient Robot for Collaborative Transport and Placement of Mechanical Meta-Material Modules

Moon, Inchul	Seoul National University
Sebastianelli, Frank	NASA
Gregg, Christine	NASA Ames Research Center
Cheung, Kenneth C.	National Aeronautics and Space Administration (NASA)
Keywords: Space Robotics and Automation, Robotics and Automation in Construction, Cooperating Robots Abstract: In-space assembly is a key capability to enable construction of large-scale structures required for sustained human presence in space. Robotic assembly is critical to reduce required crew time and risk, while modularity ensures that solutions are versatile and adaptive to complex mission concepts. NASA’s Automated Reconfigurable Mission Adaptive Digital Assembly Systems (ARMADAS) project demonstrated that robots with relatively low cost, size, and degrees-of-freedom (DoFs) can be used for large-scale modular lattice structure assembly. This is possible by using the structural modules for robotic systems metrology and error mitigation. Robots with reduced complexity may lead to advantages in initial and maintenance cost, offering an alternative to large, complex, and expensive robots. In this paper, we describe the Structure Omni-directional Foldable Explorer (SOF-E), a robot with significantly lower mass and DoF compared to the previous ARMADAS robot architecture. Although SOF-E is a five DoF robot with only two or three control states per actuator, it is capable of transporting and placing structural modules by collaborating with other instances of itself. We discuss the mechanical design and architecture of SOF-E, including analysis of energy usage during each operation. Experiments demonstrate that during locomotion and module transport tasks, SOF-E requires significantly lower energy than the previous cargo transport robot architecture, the Scaling Omni-directional Lattice Locomoting Explorer (SOLL-E). The cost of transport metric is used to compare the energy efficiency of the operation.

11:35-11:40, Paper ThCT3.5	Add to My Program
Quarry-Bot: A Reconfigurable Cable-Suspended Robot for Lunar Site Engineering

Castrejon, Zahir	University of Nevada Las Vegas
Oh, Paul Y.	University of Nevada, Las Vegas (UNLV)
Keywords: Field Robots, Robotics and Automation in Agriculture and Forestry, Space Robotics and Automation Abstract: This paper introduces Quarry-Bot, a Reconfigurable Cable-Suspended Robot developed to support the NASA Artemis program’s efforts in preparing for the long-term colonization of the Moon and Mars. Quarry-Bot autonomously clears debris on the lunar surface, a key step in site preparation for future habitats and infrastructure. The system utilizes active control strategies, combined with the Moon’s lower gravity, to perform underhand rock tosses as a scalable approach to extraterrestrial site preparation. Its reconfigurable structure, including motorized anchor points and a lightweight tripod design, adjusts cable tensions to generate swing motions for debris displacement. The system is driven by two Dynamixel MX-106 motors for movement and steering, along with a NEMA 17 stepper motor for cable adjustments. A decentralized control system, managed by Raspberry Pi units, coordinates these components. Simulations and experiments conducted under both Earth and lunar gravity conditions demonstrate the effectiveness of Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) strategies in achieving rock throws. Quarry-Bot reaches swing angles and projects rocks over distances that may support lunar site clearing and overall engineering purposes. The paper concludes by discussing potential areas for further system refinement, including adjustments for different terrain conditions and im- proved actuation strategies for lunar missions.

11:40-11:45, Paper ThCT3.6	Add to My Program
A Tugging Controller That Maximizes Lateral Resistive Force by Mounding Sandy Terrain

Moon, Deaho	University of California Berkeley
Huang, Chris	University of California Berkeley
Page, Justin	UC Berkeley Mechanical Engineering
Stuart, Hannah	UC Berkeley
Keywords: Space Robotics and Automation, Field Robots, Sensor-based Control Abstract: Sandy environments present challenges for robotic space rovers and systems due to reduced traction, limiting mobility and tugging force. This paper presents an anchoring method that utilizes a winching system to create a sand mound in front of a mobile agent dragged through the media. The proposed controller is designed to consistently achieve real-time capture of close-to-maximal lateral sand mound resistive force, even when applied to varied uneven terrains, like holes or waves. Notably, tugging is non-reversible, so suitable peaks should be captured before breakdown and without necessarily knowing the global optimum a priori. The controller logic tracks both tugging force and agent pitch gradients to detect terrain conditions and peak force trends. Results show that the controller captures an average 92% of the maximum forces, within the previously winched workspace tested, across three different granular media with four varying structured terrain features. The controller achieves higher resistive force peaks on terrains with geometric features, as opposed to flat sand. We conclude that sand mounding through tugging is a viable means to generate robotic resistive forces for unknown sandy terrains, a simple yet effective anchoring mechanism.


ThCT4 Regular Session, 304	Add to My Program
Image and 3D Segmentation 2

Chair: Burgard, Wolfram	University of Technology Nuremberg
Co-Chair: Marron, Pedro Jose	University of Duisburg-Essen

11:15-11:20, Paper ThCT4.1	Add to My Program
RMSeg-UDA: Unsupervised Domain Adaptation for Road Marking Segmentation under Adverse Conditions

Cai, Yi-Chang	National Chung Cheng University
Hsiao, Heng Chih	National Chung Cheng University
Chiu, Wei-Chen	National Chiao Tung University
Lin, Huei-Yung	National Taipei University of Technology
Chiao-Tung, Chan	Mechanical and Mechatronics Systems Research Laboratories, Indus
Keywords: Computer Vision for Transportation, Vision-Based Navigation, Intelligent Transportation Systems Abstract: The segmentation of road markings plays a crucial role in visual perception for the autonomous driving system. It enables vehicles to recognize road markings at the pixel-level, and facilitates subsequent path planning, localization, and map construction tasks. Current techniques mainly focus on normal driving scenes (i.e., clear daytime), and the performance would decrease significantly for adverse weather conditions. This work proposes RMSeg-UDA: an unsupervised domain adaptive road marking segmentation framework. By combining schedule self- training and class-conditioned adversarial training, the network utilizes both labeled normal data and unlabeled data from other domains to train a road marking segmentation model. For the evaluation on adverse conditions, a new image dataset, RLMD- AC, is established with rainy and nighttime driving scenes. The experiments conducted using both public and our datasets have demonstrated the effectiveness of the proposed technique. Code and dataset are available at https://github.com/stu9113611/RM Seg-UDA.

11:20-11:25, Paper ThCT4.2	Add to My Program
Enhancing the Utilization of Color Information in Point Cloud Semantic Segmentation

Guo, Xinyu	Wuhan University
Gao, Zhi	Temasek Laboratories @ NUS
Zhou, Zhiyu	Wuhan University
Wang, Jingshi	1.School of Aeronautics and Astronautics, Shanghai Jiao Tong Uni
Tang, Luliang	Wuhan University
Cao, Min	Wuhan Guanggu Zoyon Science and Technology Company Ltd., Wuhan 4
Keywords: Recognition, RGB-D Perception, Sensor Fusion Abstract: Point cloud semantic segmentation is crucial in various applications such as autonomous driving, robotics, and virtual reality, aiming to assign labels to each point in a cloud to reflect spatial relationships and boundaries. While previous methods primarily focus on geometric features, they often overlook the auxiliary role of color information, especially in scenes where geometric structures are less distinct. In this paper, we propose the Color Point Cloud Enhancement (CPCE) method to effectively leverage color information for improved 3D scene understanding. CPCE introduces a color information enhancement module with multi-scale consistency, enriching point features throughout the encoder stages. Additionally, we develop a novel contrastive learning module that uses relative color coordinates for point cloud serialization, allowing for the capture of positive and negative samples from distant points with similar color textures. Furthermore, we design a contrastive learning module tailored for scenes with weak geometric structures, enhancing feature representation through color-augmented contrast. Our method achieved a 78.1% mIoU on the ScanNet dataset, outperforming existing models trained on a single dataset. These results highlight the effectiveness of CPCE in scenarios where traditional methods struggle, particularly in enhancing segmentation accuracy by utilizing color as a critical feature.

11:25-11:30, Paper ThCT4.3	Add to My Program
UltraFastCrackSeg: A Lightweight Real-Time Crack Segmentation Model with Task-Oriented Pretraining

Qi, Weiqing	HKUST
Zhao, Guoyang	HKUST(GZ)
Ma, Fulong	The Hong Kong University of Science and Technology
Liu, Ming	Hong Kong University of Science and Technology (Guangzhou)
Yang, Yang	Hong Kong University of Science and Technology (Guangzhou)
Keywords: Computer Vision for Automation, Object Detection, Segmentation and Categorization, Robotics in Under-Resourced Settings Abstract: Crack segmentation is pivotal for structural health monitoring, enabling the timely maintenance of critical infrastructure such as bridges and roads. However, existing deep learning models are often too computationally intensive for deployment on resource-constrained devices. To address this limitation, we introduce UltraFastCrackSeg, a lightweight model designed for real-time crack segmentation that effectively balances high accuracy with low computational demands. Featuring an efficient encoder-decoder architecture, our model significantly reduces parameter count and floating point operations (FLOPs) compared to current methods. We further enhance performance through a self-supervised pretraining approach that employs a novel, task-oriented masking strategy, thereby improving feature extraction. Experiments across multiple datasets demonstrate that UltraFastCrackSeg achieves state-of-the-art Intersection over Union (IoU) and F1 scores while maintaining a compact model size and high inference speed. Evaluations on a low-power CPU device confirm its capability to achieve up to 80 frames per second (FPS) with ONNX runtime optimization, making it highly suitable for real-time, on-site applications. These findings establish UltraFastCrackSeg as a robust and efficient solution for practical crack detection tasks.

11:30-11:35, Paper ThCT4.4	Add to My Program
Enhancing 3D Scene Graphs with Real-Time Room Classification

Janzon, Simon	University of Duisburg Essen
Medina Sanchez, Carlos	Duisburg Essen University
Golkowski, Alexander Julian	University of Duisburg-Essen
Handte, Marcus	University of Duisburg-Essen
Marron, Pedro Jose	University of Duisburg-Essen
Keywords: Semantic Scene Understanding, Software Architecture for Robotic and Automation Abstract: In recent years, 3D scene graphs have become a critical tool in robotics and computer vision for enabling systems to understand both the geometric and semantic aspects of their surroundings. These data structures represent spatial and semantic relationships between objects in a three-dimensional environment, supporting tasks like navigation, object manipulation, and scene understanding. This paper presents a real-time pipeline for 3D scene graph generation that offers flexibility in image segmentation techniques while incorporating room classification that is based on a Random Forest model. Our work enables robots to dynamically update their understanding of complex and large-scale environments in real-time. We evaluate our approach systematically on a dataset and in a real-life experiment. The results demonstrate the capability of running our solution at over 10 Hz on an Nvidia Jetson AGX Orin SoC while also scaling favorably in larger environments. Our proposed room classification approach predicts classes with an average accuracy of 80%.

11:35-11:40, Paper ThCT4.5	Add to My Program
MFSeg: Efficient Multi-Frame 3D Semantic Segmentation

Huang, Chengjie	University of Waterloo
Czarnecki, Krzysztof	University of Waterloo
Keywords: Deep Learning for Visual Perception Abstract: We propose MFSeg, an efficient multi-frame 3D semantic segmentation framework. By aggregating point cloud sequences at the feature level and regularizing the feature extraction and aggregation process, MFSeg reduces computational overhead while maintaining high accuracy. Moreover, by employing a lightweight MLP-based point decoder, our method eliminates the need to upsample redundant points from past frames. Experiments on the nuScenes and Waymo datasets show that MFSeg outperforms existing methods, demonstrating its effectiveness and efficiency.

11:40-11:45, Paper ThCT4.6	Add to My Program
A Good Foundation Is Worth Many Labels: Label-Efficient Panoptic Segmentation

Vödisch, Niclas	University of Freiburg
Petek, Kürsat	University of Freiburg
Käppeler, Markus	University of Freiburg
Valada, Abhinav	University of Freiburg
Burgard, Wolfram	University of Technology Nuremberg
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Robotics and Automation in Agriculture and Forestry Abstract: A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at https://pastel.cs.uni-freiburg.de.


ThCT5 Regular Session, 305	Add to My Program
Explainable AI in Robotics

Chair: Chernova, Sonia	Georgia Institute of Technology
Co-Chair: Feil-Seifer, David	University of Nevada, Reno

11:15-11:20, Paper ThCT5.1	Add to My Program
CE-MRS: Contrastive Explanations for Multi-Robot Systems

Schneider, Ethan	Georgia Institute of Technology
Wu, Daniel	Georgia Institute of Technology
Das, Devleena	Georgia Institute of Technology
Chernova, Sonia	Georgia Institute of Technology
Keywords: Design and Human Factors, Human Factors and Human-in-the-Loop, Multi-Robot Systems Abstract: As the complexity of multi-robot systems grows to incorporate a greater number of robots, more complex tasks, and longer time horizons, the solutions to such problems often become too complex to be fully intelligible to human users. In this work, we introduce an approach for generating natural language explanations that justify the validity of the system's solution to the user, or else aid the user in correcting any errors that led to a suboptimal system solution. Toward this goal, we first contribute a generalizable formalism of contrastive explanations for multi-robot systems, and then introduce a holistic approach to generating contrastive explanations for multi-robot scenarios that selectively incorporates data from multi-robot task allocation, scheduling, and motion-planning to explain system behavior. Through user studies with human operators we demonstrate that our integrated contrastive explanation approach leads to significant improvements in user ability to identify and solve system errors, leading to significant improvements in overall multi-robot team performance.

11:20-11:25, Paper ThCT5.2	Add to My Program
Affordance-Based Explanations of Robot Navigation

Halilovic, Amar	Ulm University
Krivic, Senka	University of Sarajevo
Keywords: Social HRI, Human-Centered Robotics Abstract: This paper introduces affordance-based explanations of robot navigational decisions. The rationale behind affordance-based explanations draws on the theory of affordances, a principle rooted in ecological psychology that describes potential actions the objects in the environment offer to the robot. We demonstrate how affordances can be incorporated into visual and textual explanations for common robot navigation and path-planning scenarios. Furthermore, we formalize and categorize the concept of affordance-based explanations and connect it to existing explanation types in robotics. We present the results of a user study that shows participants to be, on average, highly satisfied with visual-textual, i.e., multimodal, affordance-based explanations of robot navigation. Furthermore, we investigate the complexity of different types of textual affordance-based explanations. Our research contributes to the expanding domain of explainable robotics, focusing on explaining robot actions in navigation.

11:25-11:30, Paper ThCT5.3	Add to My Program
Explainable Reinforcement Learning Via Dynamic Mixture Policies

Schier, Maximilian	Leibniz Universität Hannover
Schubert, Frederik	Leibniz University Hannover
Rosenhahn, Bodo	Institute of Information Processing, Leibniz Universität Hannove
Keywords: Reinforcement Learning, Acceptability and Trust Abstract: Learning control policies using deep reinforcement learning has shown great success for a variety of applications, including robotics and automated driving. A key area limiting the adaptation of RL in the real world is the lack of trust in the decision-making process of such policies. Therefore, explainability is a requirement of any RL agent operating in the real world. In this work, we propose a family of control policies that are explainable-by-design regarding individual observation components on object-based scene representations. By estimating diagonal squashed Gaussian and categorical mixture distributions on sub-spaces of the decomposed observations, we develop stochastic policies with easy-to-read explanations of the decision-making process. Our design is generally applicable to any RL algorithm using stochastic policies. We showcase the explainability on an extensive suite of single- and multi-agent simulations, set- and sequence-based high-level scenes, and discrete and continuous action spaces, with performance at least on-par or better compared to standard policy architectures. In additional experiments, we analyze the robustness of our approach to its single additional hyper-parameter and examine its potential for very low computational requirements with tiny policies.

11:30-11:35, Paper ThCT5.4	Add to My Program
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation

Chang, Chun-Peng	DFKI
Pagani, Alain	German Research Center for Artificial Intelligence
Stricker, Didier	German Research Center for Artificial Intelligence
Keywords: Multi-Modal Perception for HRI, Deep Learning for Visual Perception, Visual Learning Abstract: Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model.

11:35-11:40, Paper ThCT5.5	Add to My Program
Towards Transparent Multi-Agent Autonomous Systems through Principled Multi-Source Knowledge Distillation

Zhongzheng, Guo	Chinese Academy of Military Science
Chaoran, Wang	Zhejiang University
Zhu, Xiaozhou	Chinese Academy of Military Science
Changju, Wu	Zhejiang University
Deng, Baosong	Academy of Military Science
Yao, Wen	Chinese Academy of Military Science
Keywords: AI-Based Methods, Behavior-Based Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: Many real-world robotic applications can be formulated as Multi-Agent Path-Finding (MAPF) problems and approximated using Multi-Agent Reinforcement Learning (MARL) algorithms. However, the opaque nature of the black-box neural network models employed by MARL algorithms has impeded their widespread adoption due to concerns over interpretability, debugging, and user trust.To address these limitations, we propose an interpretable MAPF framework that emulates a group of n path-finding agents optimized through reinforcement learning (RL) using behavior trees (BTs), where n is the number of agents in path-finding scenarios. Expert behavior datasets consisting of state-action trajectories from MARL algorithms are generated, and a knowledge distillation approach is employed to reduce the size of the datasets and extract implicit rules.Additionally, a principled rules factorization technique based on Boolean algebra theory is utilized to prune the behavior rules and create more compact BTs representations.The proposed framework is evaluated on randomly generated MAPF scenarios and demonstrates superior performance compared to conventional BTs generation methods. This paper advances the field of interpretable AI by enabling the extraction of understandable decision-making processes from complex reinforcement learning models in multi-agent systems.

11:40-11:45, Paper ThCT5.6	Add to My Program
Through the Clutter: Exploring the Impact of Complex Environments on the Legibility of Robot Motion

Schmidt-Wolf, Melanie	University of Nevada, Reno
Becker, Tyler J	University of Nevada, Reno
Oliva, Denielle	University of Nevada, Reno
Nicolescu, Monica	University of Nevada, Reno
Feil-Seifer, David	University of Nevada, Reno
Keywords: Intention Recognition, Human-Robot Collaboration, Social HRI Abstract: The environments in which the collaboration of a robot would be the most helpful to a person are frequently uncontrolled and cluttered with many objects present. Legible robot arm motion is crucial in tasks like these in order to avoid possible collisions, improve the workflow and help ensure the safety of the person. Prior work in this area, however, focuses on solutions that are tested only in uncluttered environments and there are not many results taken from cluttered environments. In this research we present a measure for clutteredness based on an entropic measure of the environment, and a novel motion planner based on potential fields. Both our measure and the planner were tested in a cluttered environment meant to represent a more typical tool-sorting task for which the person would collaborate with a robot. The in-person validation study with Baxter robots shows a significant improvement in legibility of our proposed legible motion planner compared to the current state-of-the-art legible motion planner in cluttered environments. Further, the results show a significant difference in the performance of the planners in cluttered and uncluttered environments, and the need to further explore legible motion in cluttered environments. We argue that the inconsistency of our results in cluttered environments with those obtained from uncluttered environments points out several important issues with the current research performed in the area of legible motion planners.


ThCT6 Regular Session, 307	Add to My Program
Perception for Manipulation 2

Chair: Chrysostomou, Dimitrios	Aalborg University
Co-Chair: Grotz, Markus	University of Washington (UW)

11:15-11:20, Paper ThCT6.1	Add to My Program
OpenSU3D: Open World 3D Scene Understanding Using Foundation Models

Mohiuddin, Rafay	Technical University of Munich
Prakhya, Sai Manoj	Huawei Technologies Deutscheland GmbH
Collins, Fiona	TUM
Liu, Ziyuan	Huawei Group
Borrmann, Andre	Technical University of Munich
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, RGB-D Perception Abstract: In this paper, we present a novel, scalable approach for constructing open set, instance-level 3D scene representations, advancing open world understanding of 3D environments. Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature representation, additionally struggle with contextual queries. Our method overcomes these limitations by incrementally building instance level 3D scene representations using 2D foundation models, and efficiently aggregating instance-level details such as masks, feature vectors, names, and captions. We introduce fusion schemes for feature vectors to enhance their contextual knowledge and performance on complex queries. Additionally, we explore large language models for robust automatic annotation and spatial reasoning tasks. We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities, exceeding current state-of-the-art methods in open world 3D scene understanding. Project page: https://opensu3d.github.io/

11:20-11:25, Paper ThCT6.2	Add to My Program
Task-Aware Semantic Map: Autonomous Robot Task Assignment Beyond Commands

Choi, Daewon	Hanyang University
Lee, Ho Sung	Hanyang Univ
Hwang, Soeun	Hanyang University
Oh, Yoonseon	Hanyang University
Keywords: Semantic Scene Understanding, Task Planning, Perception-Action Coupling Abstract: With recent advancements in Large Language Models, task planning methods that interpret human commands have garnered significant attention. However, as home robots become more common, specifying every daily task could become impractical. This paper introduces a novel semantic map called the Task-Aware Semantic Map (TASMap), which enables robots to autonomously assign and propose necessary tasks in a scene without explicit human commands. The core innovation of this approach is the ability of TASMap to comprehend the context of objects within a scene and autonomously generate task proposals. This capability significantly advances autonomous robotic assistance, reducing the dependency on specific commands and enhancing interaction with environments. We present two key applications of TASMap: contextual task proposal and spatial task proposal. Our results, verified across 35 diverse and realistically disordered scenes, underscore the effectiveness of TASMap in both simulation and real-world environments.

11:25-11:30, Paper ThCT6.3	Add to My Program
High-Quality Unknown Object Instance Segmentation Via Quadruple Boundary Error Refinement

Back, Seunghyeok	Korea Institute of Machinery & Materials
Lee, Sangbeom	Gwangju Institute of Science and Technology
Kim, Kangmin	Gwangju Institute of Science and Technology
Lee, Joosoon	Gwangju Institute of Science and Technology
Shin, Sungho	Hyundai Motors
Maeng, Jemo	Gwangju Institute of Science and Technology(GIST)
Lee, Kyoobin	Gwangju Institute of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Perception for Grasping and Manipulation Abstract: Accurate and efficient segmentation of unknown objects in unstructured environments is essential for robotic manipulation. Unknown Object Instance Segmentation (UOIS), which aims to identify all objects in unknown categories and backgrounds, has become a key capability for various robotic tasks. However, existing methods struggle with over-segmentation and under-segmentation, leading to failures in manipulation tasks such as grasping. To address these challenges, we propose QuBER (Quadruple Boundary Error Refinement), a novel error-informed refinement approach for high-quality UOIS. QuBER first estimates quadruple boundary errors—true positive, true negative, false positive, and false negative pixels—at the instance boundaries of the initial segmentation. It then refines the segmentation using an error-guided fusion mechanism, effectively correcting both fine-grained and instance-level segmentation errors. Extensive evaluations on three public benchmarks demonstrate that QuBER outperforms state-of-the-art methods and consistently improves various UOIS methods while maintaining a fast inference time of less than 0.1 seconds. Furthermore, we show that QuBER improves the success rate of grasping target objects in cluttered environments. Code and supplementary materials are available at https://sites.google.com/view/uois-quber.

11:30-11:35, Paper ThCT6.4	Add to My Program
Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Linok, Sergey	MIPT
Zemskova, Tatiana	AIRI, MIPT
Ladanova, Svetlana	MIPT
Titkov, Roman	Moscow Institute of Physics and Technology
Yudin, Dmitry	Moscow Institute of Physics and Technology
Monastyrny, Maxim	Sberbank of Russia
Valenkov, Aleksei	Sberbank of Russia
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, RGB-D Perception Abstract: Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at linukc.github.io/BeyondBareQueries.

11:35-11:40, Paper ThCT6.5	Add to My Program
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

He, Yonghao	D-Robotics
Su, Hu	Institute of Automation, Chinese Academy of Science
Yu, Haiyong	D-Robotics
Yang, Cong	Soochow University
Sui, Wei	Soochow University
Wang, Cong	D-Robotics
Liu, Song	ShanghaiTech University
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing

11:40-11:45, Paper ThCT6.6	Add to My Program
LBSNet: Lightweight Joint Boundary Detection and Semantic Segmentation for Transparent and Reflective Objects

Tong, Ling	Southeast University
Qian, Kun	Southeast University
Jing, Xingshuo	Southeast University
Keywords: Deep Learning for Visual Perception, Semantic Scene Understanding, Computer Vision for Automation Abstract: Accurate visual detection of transparent and reflective objects remains a challenging issue for mobile manipulators. For the most common depth cameras and LiDAR sensors, the distinctive optical attributes inherent in both transparent and reflective objects pose a significant challenge. To address this problem, this study proposes a lightweight joint boundary detection and semantic segmentation network named LBSNet. LBSNet aims to enhance the perception of transparent and reflective objects in complex and dynamic environments, using RGB images only. It leverages the synergy between boundary detection and semantic segmentation through feature fusion and a multitask learning mechanism. The encoder consists of two paths: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The gated channel adaptive (GCA) module enhances boundary features by learning channel parameters. The dynamic adaptive feature fusion (DAFF) module dynamically adjusts semantic and boundary information through cross-feature fusion. These methods effectively capture the distinctive characteristics of transparent and reflective objects, such as light refraction, boundary blurring and low contrast. Experimental results show that LBSNet achieves higher accuracy and faster processing speed on multiple public datasets compared with existing methods. Moreover, its lightweight design makes it suitable for resource-constrained mobile manipulators.


ThCT7 Regular Session, 309	Add to My Program
Marine Robotics 6

Chair: Johnson-Roberson, Matthew	Carnegie Mellon University
Co-Chair: Roznere, Monika	Binghamton University

11:15-11:20, Paper ThCT7.1	Add to My Program
Stonefish: Supporting Machine Learning Research in Marine Robotics

Grimaldi, Michele	University of Girona
Cieslak, Patryk	Universitat De Girona
Ochoa, Eduardo	Universitat De Girona
Bharti, VIbhav	Heriot Watt University
Rajani, Hayat	University of Girona
Carlucho, Ignacio	University of Edinburgh
Koskinopoulou, Maria	Heriot-Watt University
Petillot, Yvan R.	Heriot-Watt University
Gracias, Nuno	University of Girona
Keywords: Marine Robotics, Simulation and Animation Abstract: Simulations are highly valuable in marine robotics, offering a cost-effective and controlled environment for testing in the challenging conditions of underwater and surface operations. Given the high costs and logistical difficulties of real-world trials, simulators capable of capturing the operational conditions of subsea environments have become key in developing and refining remotely-operated and autonomous underwater vehicles. This paper highlights recent enhancements to the Stonefish simulator, an advanced open-source platform supporting development and testing of marine robotics solutions. Key updates include a suite of additional sensors, such as an event-based camera, a thermal camera, and an optical flow camera, as well as, visual light com- munication, support for tethered operations, improved thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy. These developments and an automated annotation tool significantly bolster Stonefish’s role in marine robotics research, especially in the field of deep learning, where training data with a known ground truth is hard or impossible to collect. https://github.com/patrykcieslak/stonefish

11:20-11:25, Paper ThCT7.2	Add to My Program
Sea-U-Whale: A Reconfigurable Marine Robot with Multi-Modal Motion

Ding, Wendi	The Chinese University of Hong Kong
Zhao, Zuoquan	The Chinese University of Hong Kong
Yan, Ruixin	The Chinese University of Hong Kong
Gao, Songqun	The Chinese University of Hong Kong
Guo, Zixuan	The Chinese University of Hong Kong
Liu, Xuchen	The Chinese University of Hong Kong
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Marine Robotics, Actuation and Joint Mechanisms, Product Design, Development and Prototyping Abstract: As marine exploration becomes increasingly important, marine robots have been extensively studied in recent years. Despite some well-designed robots have already achieved to various successful missions, most existing robots struggle to adapt to diverse demands or tasks due to their fixed structure and complexity of the marine environment. To address these challenges, we present a novel reconfigurable marine robot named Sea-U-Whale. This system can dynamically adjust its actuator configuration in the marine environment, providing superior environmental adaptability, maneuverability, and versatile mobility. Considering the demands of unmanned ocean exploration, an active reconfiguration mechanism and three distinct vehicle modes are designed for optimal actuation in various marine scenarios. The multi-modal mobility of our system and its robust performance have been validated through extensive field tests and water tank experiments, demonstrating its potential in handling a wide range of mission profiles.

11:25-11:30, Paper ThCT7.3	Add to My Program
MERLION: Marine ExploRation with Language guIded Online iNformative Visual Sampling and Enhancement

Thengane, Shrutika	Singapore University of Technology and Design
Prasetyo, Marcel Bartholomeus	Singapore University of Technology and Design
Tan, Yu Xiang	Singapore University of Technology and Design
Meghjani, Malika	Singapore University of Technology and Design
Keywords: Marine Robotics, Environment Monitoring and Management, Computer Vision for Automation Abstract: Autonomous and targeted underwater visual monitoring and exploration using Autonomous Underwater Vehicles (AUVs) can be a challenging task due to both online and offline constraints. The online constraints comprise limited onboard storage capacity and communication bandwidth to the surface, whereas the offline constraints entail the time and effort required for the selection of desired keyframes from the video data. An example use case of targeted underwater visual monitoring is finding the most interesting visual frames of fish in a long sequence of an AUV's visual experience. This challenge of targeted informative sampling is further aggravated in murky waters with poor visibility. In this paper, we present MERLION, a novel framework that provides semantically aligned and visually enhanced summaries for murky underwater marine environment monitoring and exploration. Specifically, our framework integrates (a) an image-text model for semantically aligning the visual samples to the user's needs, (b) an image enhancement model for murky water visual data and (c) an informative sampler for summarizing the monitoring experience. We validate our proposed MERLION framework on real-world data with user studies and present qualitative and quantitative results using our evaluation metric and show improved results compared to the state-of-the-art approaches. The code is available at https://github.com/MARVL-Lab/MERLION.git

11:30-11:35, Paper ThCT7.4	Add to My Program
PoLaRIS Dataset: A Maritime Object Detection and Tracking Dataset in Pohang Canal

Choi, Jiwon	Inha University
Cho, Dongjin	Inha University
Lee, Gihyeon	Inha University
Kim, Hogyun	Inha University
Yang, Geonmo	Inha University
Kim, Joowan	Samsung Heavy Industries
Cho, Younggun	Inha University
Keywords: Marine Robotics, Data Sets for Robotic Vision, Sensor Fusion Abstract: Maritime environments often present hazardous situations due to factors such as moving ships or buoys, which become obstacles under the influence of waves. In such challenging conditions, the ability to detect and track potentially hazardous objects is critical for the safe navigation of marine robots, but datasets capturing these scenarios remain limited. To address this limitation, we introduce a new multi-modal dataset that includes image and point-wise annotations of maritime obstacles. Our dataset provides detailed ground truth for obstacle detection and tracking, including objects as small as 10×10 pixels, which are crucial for maritime safety. To validate the dataset’s effectiveness as a reliable benchmark, we conducted evaluations using various methodologies, including state-of-the-art (SOTA) techniques for object detection and tracking. These evaluations are expected to contribute to improving performance, particularly in the complex maritime environment. This represents the first demonstration of a dataset offering multi-modal annotations specifically tailored to maritime environments. Our dataset is available at https: //github.com/sparolab/PoLaRIS.

11:35-11:40, Paper ThCT7.5	Add to My Program
Confidence-Aware Object Capture for a Manipulator Subject to Floating-Base Disturbances

Xu, Ruoyu	The Chinese University of Hong Kong, Shenzhen
Jiang, Zixing	The Chinese University of Hong Kong
Liu, Beibei	The Chinese University of Hongkong, Shenzhen
Wang, Yuquan	Tencent
Qian, Huihuan (Alex)	The Chinese University of Hong Kong, Shenzhen
Keywords: Marine Robotics, Field Robots, Robotics in Hazardous Fields, Floating-Base Manipulator Abstract: Capturing stationary aerial objects on unmanned surface vehicles (USVs) is challenging due to quasiperiodic and fast floating-base motions caused by wave-induced disturbances. It is hard to (1) maintain high motion prediction accuracy due to the stochastic nature of these disturbances and (2) perform object capture through real-time tracking due to the limited active torque. We introduce confidence analysis in predictive capture. To address the inaccuracy predictions, we calculate a real-time confidence tube to evaluate the prediction quality. To overcome tracking difficulties, we plan a trajectory to capture the object at a future moment while maximizing the confidence of the capture position on the predicted trajectory. All calculations are completed within 0.2 seconds to ensure a timely response. We validate our approach through experiments, where we simulate disturbances by executing real USV motions using a servo platform. The results demonstrate that our method achieves an 80% success rate.

11:40-11:45, Paper ThCT7.6	Add to My Program
RecGS: Removing Water Caustic with Recurrent Gaussian Splatting

Zhang, Tianyi	Carnegie Mellon University
Zhi, Weiming	Carnegie Mellon University
Meyers, Braden	Brigham Young University
Durrant, Sterling Nelson	Brigham Young University
Huang, Kaining	Carnegie Mellon University
Mangelson, Joshua	Brigham Young University
Barbalata, Corina	Louisiana State University
Johnson-Roberson, Matthew	Carnegie Mellon University
Keywords: Marine Robotics, Deep Learning for Visual Perception Abstract: Water caustics are commonly observed in seafloor imaging data from shallow-water areas. Traditional methods that remove caustic patterns from images often rely on 2D filtering or pre-training on an annotated dataset, hindering the performance when generalizing to real-world seafloor data with 3D structures. In this paper, we present a novel method Recurrent Gaussian Splatting (RecGS), which takes advantage of today’s photorealistic 3D reconstruction technology, 3D Gaussian Splatting (3DGS), to separate caustics from seafloor imagery. With a sequence of images taken by an underwater robot, we build 3DGS recurrently and decompose the caustic with low-pass filtering in each iteration. In the experiments, we analyze and compare with different methods, including joint optimization, 2D filtering, and deep learning approaches. The results show that our proposed RecGS paradigm can effectively separate the caustic from the seafloor, improving the visual appearance, and can be potentially applied on more problems with inconsistent illumination.


ThCT8 Regular Session, 311	Add to My Program
Aerial Robots: Learning 2

Chair: Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Co-Chair: Shim, David Hyunchul	KAIST

11:15-11:20, Paper ThCT8.1	Add to My Program
Learning to Fly in Seconds

Eschmann, Jonas	New York University
Albani, Dario	Technology Innovation Institure
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Applications, Machine Learning for Robot Control, Reinforcement Learning Abstract: Learning-based methods, particularly Reinforcement Learning (RL), hold great promise for streamlining deployment, enhancing performance, and achieving generalization in the control of autonomous multirotor aerial vehicles. Deep RL has been able to control complex systems with impressive fidelity and agility in simulation but the simulation-to-reality transfer often brings a hard-to-bridge reality gap. Moreover, RL is commonly plagued by prohibitively long training times. In this work, we propose a novel asymmetric actor-critic-based architecture coupled with a highly reliable RL-based training paradigm for end-to-end quadrotor control. We show how curriculum learning and a highly optimized simulator enhance sample complexity and lead to fast training times. To precisely discuss the challenges related to low-level/end-to-end multirotor control, we also introduce a taxonomy that classifies the existing levels of control abstractions as well as non-linearities and domain parameters. Our framework enables Simulation-to-Reality (Sim2Real) transfer for direct RPM control after only 18 seconds of training on a consumer-grade laptop as well as its deployment on microcontrollers to control a multirotor under real-time guarantees. Finally, our solution exhibits competitive performance in trajectory tracking, as demonstrated through various experimental comparisons with existing state-of-the-art control solutions using a real Crazyflie nano quadrotor. We open source the code including a very fast multirotor dynamics simulator that can simulate about 5 months of flight per second on a laptop GPU. The fast training times and deployment to a cheap, off-the-shelf quadrotor lower the barriers to entry and help democratize the research and development of these systems.

11:20-11:25, Paper ThCT8.2	Add to My Program
Multi-UAVs End-To-End Distributed Trajectory Generation Over Point Cloud Data

Marino, Antonio	University of Rennes
Pacchierotti, Claudio	Centre National De La Recherche Scientifique (CNRS)
Robuffo Giordano, Paolo	Irisa Cnrs Umr6074
Keywords: Aerial Systems: Mechanics and Control, Multi-Robot Systems, Deep Learning Methods Abstract: This paper introduces an end-to-end trajectory planning algorithm tailored for multi-UAV systems that gener- ates collision-free trajectories in environments populated with both static and dynamic obstacles, leveraging point cloud data. Our approach consists of a 2-branch neural network fed with sensing and localization data, able to communicate intermediate learned features among the agents. One network branch crafts an initial collision-free trajectory estimate, while the other devises a neural collision constraint for subsequent optimiza- tion, ensuring trajectory continuity and adherence to physical actuation limits. Extensive simulations in challenging cluttered environments, involving up to 25 robots and 25% obstacle density, show a collision avoidance success rate in the range of 100 − 85%. Finally, we introduce a saliency map computation method acting on the point cloud data, offering qualitative insights into our methodology.

11:25-11:30, Paper ThCT8.3	Add to My Program
Lightweight yet High-Performance Defect Detector for UAV-Based Large-Scale Infrastructure Real-Time Inspection

Zhao, Benyun	The Chinese University of Hong Kong
Duan, Qigeng	The Chinese University of Hong Kong
Yang, Guidong	The Chinese University of Hong Kong
Tang, Haoyun (Jerry)	UC Berkeley
Song, Zhenbo	Nanjing University of Science and Technology
Wen, Junjie	The Chinese University of Hong Kong
Liu, Xuchen	The Chinese University of Hong Kong
Li, Qingxiang	The Chineses University of Hong Kong
Lei, Lei	City University of Hong Kong
Zhang, Jihan	Chinese University of Hong Kong
Chen, Xi	The Chinese University of Hong Kong
Mueller, Mark Wilfried	University of California, Berkeley
Chen, Ben M.	Chinese University of Hong Kong
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Deep Learning Methods Abstract: Defect diagnosis in urban infrastructure is crucial for public safety. Traditional manual inspections face significant challenges in terms of accuracy and cost-effectiveness. In this paper, we propose a lightweight and hardware-friendly large-scale infrastructure detector, CUPID, highly suitable for unmanned aerial vehicles (UAVs). Given the significant challenges in automatically detecting defects of varying intensity and size within complex infrastructure, along with the tendency of lightweight models to lose detail and fail to fully capture features during the defect extraction process, we propose the CUPID_Block, a multi-level information fusion block to construct the backbone, featuring the CUPID_Conv module equipped with our proposed CCA (CrissCross Attention). Furthermore, CUPID features an auxiliary training branch that assimilates lower feature maps, helping to recover details lost in deeper convolutional layers. To verify the effectiveness of CUPID and to address the lack of a suitable dataset in the community, we establish a multi-scenario infrastructure defect dataset, CUBIT2024, to conduct extensive experiments. Finally, to assess the efficiency and adaptability of CUPID in UAV for online infrastructure inspection, we design a compact autonomous drone, CU-Astro, where the proposed CUPID is deployed on the Jetson Orin NX computer onboard to evaluate the speed and power consumption of the inference.

11:30-11:35, Paper ThCT8.4	Add to My Program
ProxFly: Robust Control for Close Proximity Quadcopter Flight Via Residual Reinforcement Learning

Zhang, Ruiqi	University of California, Berkeley
Zhang, Dingqi	University of California, Berkeley
Mueller, Mark Wilfried	University of California, Berkeley
Keywords: Reinforcement Learning, Aerial Systems: Mechanics and Control, Robust/Adaptive Control Abstract: This paper proposes the ProxFly, a residual deep Reinforcement Learning (RL)-based controller for close proximity quadcopter flight. Specifically, we design a residual module on top of a cascaded controller (denoted as basic controller) to generate high-level control commands, which compensate for external disturbances and thrust loss caused by downwash effects from other quadcopters. First, our method takes only the ego state and controllers' commands as inputs and does not rely on any communication between quadcopters, thereby reducing the bandwidth requirement. Through domain randomization, our method relaxes the requirement for accurate system identification and fine-tuned controller parameters, allowing it to adapt to changing system models. Meanwhile, our method not only reduces the proportion of unexplainable signals from the black box in control commands but also enables the RL training to skip the time-consuming exploration from scratch via guidance from the basic controller. We validate the effectiveness of the residual module in the simulation with different proximities. Moreover, we conduct the real close proximity flight test to compare ProxFly with the basic controller and an advanced model-based controller with complex aerodynamic compensation. Finally, we show that ProxFly can be used for challenging quadcopter mid-air docking, where two quadcopters fly in extreme proximity, and strong airflow significantly disrupts flight. However, our method can stabilize the quadcopter in this case and accomplish docking. The resources are available at https://github.com/ruiqizhang99/ProxFly.

11:35-11:40, Paper ThCT8.5	Add to My Program
TempFuser: Learning Agile, Tactical, and Acrobatic Flight Maneuvers Using a Long Short-Term Temporal Fusion Transformer

Seong, Hyunki	KAIST
Shim, David Hyunchul	KAIST
Keywords: Aerial Systems: Applications, Reinforcement Learning, Machine Learning for Robot Control Abstract: Dogfighting is a challenging scenario in aerial applications that requires a comprehensive understanding of both strategic maneuvers and the aerodynamics of agile aircraft. The aerial agent needs to not only understand tactically evolving maneuvers of fighter jets from a long-term perspective but also react to rapidly changing aerodynamics of aircraft from a short-term viewpoint. In this paper, we introduce TempFuser, a novel long short-term temporal fusion transformer architecture that can learn agile, tactical, and acrobatic flight maneuvers in complex dogfight problems. Our approach integrates two distinct temporal transition embeddings into a transformer-based network to comprehensively capture both the long-term tactics and short-term agility of aerial agents. By incorporating these perspectives, our policy network generates end-to-end flight commands that secure dominant positions over the long term and effectively outmaneuver agile opponents. After training in a high-fidelity flight simulator, our model successfully learns to execute strategic maneuvers, outperforming baseline policy models against various types of opponent aircraft. Notably, our model exhibits human-like acrobatic maneuvers even when facing adversaries with superior specifications, all without relying on prior knowledge. Moreover, it demonstrates robust pursuit performance in challenging supersonic and low-altitude situations. Demo videos are available at https://sites.google.com/view/tempfuser.

11:40-11:45, Paper ThCT8.6	Add to My Program
Modular Reinforcement Learning for a Quadrotor UAV with Decoupled Yaw Control

Yu, Beomyeol	The George Washington University
Lee, Taeyoung	George Washington University
Keywords: Aerial Systems: Mechanics and Control, Reinforcement Learning, AI-Enabled Robotics Abstract: This paper presents modular reinforcement learning (RL) frameworks for the low-level control of a quadrotor, enabling direct control of yawing motion. While traditional monolithic RL approaches have demonstrated success in real-world autonomous flight, they often struggle to precisely control both the translational and yawing motions due to their distinct dynamic characteristics and strong coupling. Moreover, training a large-scale monolithic network typically demands a wealth of training data for broad generalization. To address these issues, we decompose the quadrotor dynamics into translational and yawing subsystems and assign dedicated modular RL agents for each. This design significantly improves performance, as each RL agent is trained for its specific purpose, and they are integrated in a synergistic way. It further enhances robustness, as potential failures within one module have minimal impact on the other, promoting fault tolerance. These improvements are illustrated by flight experiments achieved via zero-shot sim-to-real transfer. It is shown that the proposed modular policies substantially enhance training efficiency, tracking performance, and adaptability to real-world conditions.


ThCT9 Regular Session, 312	Add to My Program
Task and Motion Planning 2

Chair: Pappas, George J.	University of Pennsylvania
Co-Chair: Ashur, Stav	University of Illinois

11:15-11:20, Paper ThCT9.1	Add to My Program
HBTP: Heuristic Behavior Tree Planning with Large Language Model Reasoning

Cai, Yishuai	National University of Defense Technology
Chen, Xinglin	National University of Defense Technology
Mao, Yunxin	National University of Defense Technology
Li, Minglong	National University of Defense Technology
Yang, Shaowu	National University of Defense Technology
Yang, Wenjing	State Key Laboratory of High Performance Computing (HPCL), Schoo
Wang, Ji	National University of Defense Technology
Keywords: AI-Enabled Robotics Abstract: Behavior Trees (BTs) are increasingly becoming a popular control structure in robotics due to their modularity, reactivity, and robustness. In terms of BT generation methods, BT planning shows promise for generating reliable BTs. However, the scalability of BT planning is often constrained by prolonged planning times in complex scenarios, largely due to a lack of domain knowledge. In contrast, pre-trained Large Language Models (LLMs) have demonstrated task reasoning capabilities across various domains, though the correctness and safety of their planning remain uncertain. This paper proposes integrating BT planning with LLM reasoning, introducing Heuristic Behavior Tree Planning (HBTP)—a reliable and efficient framework for BT generation. The key idea in HBTP is to leverage LLMs for task-specific reasoning to generate a heuristic path, which BT planning can then follow to expand efficiently. We first introduce the heuristic BT expansion process, along with two heuristic variants designed for optimal planning and satisficing planning, respectively. Then, we propose methods to address the inaccuracies of LLM reasoning, including action space pruning and reflective feedback, to further enhance both reasoning accuracy and planning efficiency. Experiments demonstrate the theoretical bounds of HBTP, and results from four datasets confirm its practical effectiveness in everyday service robot applications.

11:20-11:25, Paper ThCT9.2	Add to My Program
SPINE: Online Semantic Planning for Missions with Incomplete Natural Language Specifications in Unstructured Environments

Ravichandran, Zachary	University of Pennsylvania
Murali, Varun	University of Pennsylvania
Tzes, Mariliza	University of Pennsylvania
Pappas, George J.	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: AI-Enabled Robotics, Autonomous Agents, Field Robots Abstract: As robots become increasingly capable, users will want to describe high-level missions and have robots infer the relevant details. Because pre-built maps are difficult to obtain in many realistic settings, accomplishing such missions will require the robot to map and plan online. While many semantic planning methods operate online, they are typically designed for well specified missions such as object search or exploration. Recently, Large Language Models (LLMs) have demonstrated powerful contextual reasoning abilities over a range of robotic tasks described in natural language. However, existing LLM-enabled planners typically do not consider online planning or complex missions; rather, relevant subtasks and semantics are provided by a pre-built map or a user. We address these limitations via SPINE, an online planner for missions with incomplete mission specifications provided in natural language. The planner uses an LLM to reason about subtasks implied by the mission specification and then realizes these subtasks in a receding horizon framework. Tasks are automatically validated for safety and refined online with new map observations. We evaluate SPINE in simulation and real-world settings with missions that require multiple steps of semantic reasoning and exploration in cluttered outdoor environments of over 20,000 square meters. Compared to baselines that use existing LLM-enabled planning approaches, our method is over twice as efficient in terms of time and distance, requires less user interactions, and does not require a full map. Additional resources are provided at https://zacravichandran.github.io/SPINE.

11:25-11:30, Paper ThCT9.3	Add to My Program
Closed Loop Interactive Embodied Reasoning for Robot Manipulation

Nazarczuk, Michal	Imperial College London
Behrens, Jan Kristof	Czech Technical University in Prague, CIIRC
Stepanova, Karla	Czech Technical University
Hoffmann, Matej	Czech Technical University in Prague, Faculty of Electrical Engi
Mikolajczyk, Krystian	Imperial College London
Keywords: AI-Enabled Robotics, Manipulation Planning, Reactive and Sensor-Based Planning Abstract: Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g., sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in 10 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above 76% and 64%, respectively.

11:30-11:35, Paper ThCT9.4	Add to My Program
SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models

Ginting, Muhammad Fadhil	Stanford University
Kim, Dong Ki	Massachusetts Institute of Tech
Kim, Sung-Kyun	NASA Jet Propulsion Laboratory, Caltech
Bandi, Jai Krishna	Field AI
Kochenderfer, Mykel	Stanford University
Omidshafiei, Shayegan	Massachusetts Institute of Technology
Agha-mohammadi, Ali-akbar	NASA-JPL, Caltech
Keywords: AI-Enabled Robotics, Field Robots, Task and Motion Planning Abstract: This paper addresses the problem of task planning for robots that must comply with operational manuals in real-world settings. Task planning under these constraints is essential for enabling autonomous robot operation in domains that require adherence to domain-specific knowledge. Current methods for generating robot goals and plans rely on common sense knowledge encoded in large language models. However, these models lack grounding of robot plans to domain-specific knowledge and are not easily transferable between multiple sites or customers with different compliance needs. In this work, we present SayComply, which enables grounding robotic task planning with operational compliance using retrieval-based language models. We design a hierarchical database of operational, environment, and robot embodiment manuals and procedures to enable efficient retrieval of the relevant context under the limited context length of the LLMs. We then design a task planner using a tree-based retrieval augmented generation (RAG) technique to generate robot tasks that follow user instructions while simultaneously complying with the domain knowledge in the database. We demonstrate the benefits of our approach through simulations and hardware experiments in real-world scenarios that require precise context retrieval across various types of context, outperforming the standard RAG method. Our approach bridges the gap in deploying robots that consistently adhere to operational protocols, offering a scalable and edge-deployable solution for ensuring compliance across varied and complex real-world environments.

11:35-11:40, Paper ThCT9.5	Add to My Program
LiP-LLM: Integrating Linear Programming and Dependency Graph with Large Language Models for Multi-Robot Task Planning

Obata, Kazuma	Osaka University
Aoki, Tatsuya	Osaka University
Horii, Takato	Osaka University
Taniguchi, Tadahiro	Ritsumeikan University
Nagai, Takayuki	Osaka University
Keywords: Multi-Robot Systems, Task Planning, Cooperating Robots Abstract: This study proposes LiP-LLM: integrating linear programming and dependency graph with large language models (LLMs) for multi-robot task planning. In order for multiple robots to perform tasks more efficiently, it is necessary to manage the precedence dependencies between tasks. Although multi-robot decentralized and centralized task planners using LLMs have been proposed, none of these studies focus on precedence dependencies from the perspective of task efficiency or leverage traditional optimization methods. It addresses key challenges in managing dependencies between skills and optimizing task allocation. LiP-LLM consists of three steps: skill list generation and dependency graph generation by LLMs, and task allocation using linear programming. The LLMs are utilized to generate a comprehensive list of skills and to construct a dependency graph that maps the relationships and sequential constraints among these skills. To ensure the feasibility and efficiency of skill execution, the skill list is generated by calculated likelihood, and linear programming is used to optimally allocate tasks to each robot. Experimental evaluations in simulated environments demonstrate that this method outperforms existing task planners, achieving higher success rates and efficiency in executing complex, multi-robot tasks. The results indicate the potential of combining LLMs with optimization techniques to enhance the capabilities of multi-robot systems in executing coordinated tasks accurately and efficiently. In an environment with two robots, a maximum success rate difference of 0.82 is observed in the language instruction group with a change in the object name.

11:40-11:45, Paper ThCT9.6	Add to My Program
Transformer-Based Model Predictive Control: Trajectory Optimization Via Sequence Modeling

Celestini, Davide	Politecnico Di Torino
Gammelli, Daniele	Stanford
Guffanti, Tommaso	Stanford University
D’Amico, Simone	Stanford University
Capello, Elisa	Politecnico Di Torino CNR IEIIT
Pavone, Marco	Stanford University
Keywords: Optimization and Optimal Control, Deep Learning Methods, Machine Learning for Robot Control Abstract: Model predictive control (MPC) has established itself as the primary methodology for constrained control, enabling general-purpose robot autonomy in diverse real-world scenarios. However, for most problems of interest, MPC relies on the recursive solution of highly non-convex trajectory optimization problems, leading to high computational complexity and strong dependency on initialization. In this work, we present a unified framework to combine the main strengths of optimization-based and learning-based methods for MPC. Our approach entails embedding high-capacity, transformer-based neural network models within the optimization process for trajectory generation, whereby the transformer provides a near-optimal initial guess, or target plan, to a non-convex optimization problem. Our experiments, performed in simulation and the real world onboard a free flyer platform, demonstrate the capabilities of our framework to improve MPC convergence and runtime. Compared to purely optimization-based approaches, results show that our approach can improve trajectory generation performance by up to 75%, reduce the number of solver iterations by up to 45%, and improve overall MPC runtime by 7x without loss in performance.


ThCT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 5

Chair: Saeedi, Sajad	Toronto Metropolitan University
Co-Chair: Sabattini, Lorenzo	University of Modena and Reggio Emilia

11:15-11:20, Paper ThCT10.1	Add to My Program
A Method for Constructing Building Structure Grid Map Based on a Climbing Algorithm

Zhou, Xidong	Hunan University
Zhong, Hang	Hunan University
Zhang, Hui	Hunan University
Chen, MingYuan	Hunan University
Yu, Haoyang	Hunan University
Wang, Weizheng	Hunan University
Wang, Yaonan	Hunan University
Keywords: Aerial Systems: Perception and Autonomy, Mapping, Motion and Path Planning Abstract: Aerial-terrestrial amphibious robots excel in search and rescue tasks in unstructured terrains but face challenges in autonomous navigation indoors. Traditional full-mapping methods can degrade global path planning performance, especially when semi-static obstacles shift, leading to suboptimal paths. We propose a method for constructing building structure grid maps that are unaffected by semi-static obstacles. Our approach includes a building structure recognition algorithm based on an octree structure to differentiate between occupied and free grid cells. Experimental results demonstrate that coverage path planning on building structure grid maps produces superior global paths compared to traditional grid maps, offering a more streamlined and robust solution for autonomous navigation of aerial-terrestrial amphibious robots in indoor environments.

11:20-11:25, Paper ThCT10.2	Add to My Program
Efficient Scale-Uniform 3D Visual Coverage Algorithm for UAV Based on Elastic Photogrammetric Constraints

Zong, Jianping	Nankai University
Cao, Zhongzhi	Nankai University
Chen, Qi	Nankai University
Sun, Chuanyu	Nankai University
Shao, Xiuli	Nankai University
Li, Haifeng	Civil Aviation University of China
Wang, Hongpeng	Nankai University
Keywords: Aerial Systems: Applications, Environment Monitoring and Management, Search and Rescue Robots Abstract: Unmanned aerial vehicles equipped with modern vision algorithms are crucial for missions such as reconstruction and target acquisition. However, when deployed in the field, undulating terrain can cause significant fluctuations in image scale and degrade the performance of vision algorithms. Instead of developing specialized image processing schemes with limited adaptability, this paper presents a novel 3D visual coverage algorithm that is compatible with existing generic vision algorithms and maintains a uniform image scale for ground targets. In detail, photogrammetric constraints are initially introduced to generate aerial waypoints, and then the negative effects of valley clustering are addressed. Elastic Photogrammetric Constraints (EPC) are further proposed to eliminate valley clustering effects induced by saddle terrain. The experimental results demonstrate that EPC reduces the traversal path length by up to 37.38% compared to the previous work, but with a minor trade-off in scale variations.

11:25-11:30, Paper ThCT10.3	Add to My Program
Target-Aware Viewpoint Generation for Active Robotic Exploration in Unknown Environments

Xu, Pu	Northeastern University
Liu, Haoming	Northeastern University(CN)
Li, Zhiheng	Northeastern University
Bai, Zhaoqiang	Northeastern University
Fang, Zheng	Northeastern University
Keywords: Search and Rescue Robots, Constrained Motion Planning, Motion and Path Planning Abstract: When entering an unfamiliar environment, animals usually sweep off their surroundings to identify points of interest. In search and rescue robotics, autonomous exploration requires both coarse mapping of unknown areas and detailed target detection, which poses a significant challenge in balancing these tasks. To that end, we propose a target-aware robotic exploration framework that prioritizes both exploration efficiency and search completeness through three components: First, considering the computational limitations of robotic platforms, a lightweight 3D target detection method with post-fusion is introduced to detect target positions in real time. Secondly, we propose a target-aware viewpoint generation approach that integrates information gain and inspection gain to identify promising viewpoints for thorough target searches. Lastly, since a detailed examination of the environment demands numerous viewpoints, we propose a heuristic-based active exploration framework that employs a hierarchical structure to optimize exploration gain, traveling distance, and path smoothness to maximize the utility function of viewpoint sequences and ultimately find the optimal path. Extensive simulations and real-world experiments demonstrate our framework significantly enhances target search capabilities, achieving a 13% average improvement in exploration efficiency over existing methods.

11:30-11:35, Paper ThCT10.4	Add to My Program
Online Multi-Robot Federated Learning for Distributed Coverage Control of Unknown Spatial Processes

Mantovani, Mattia	University of Modena and Reggio Emilia
Pratissoli, Federico	Università Degli Studi Di Modena E Reggio Emilia
Sabattini, Lorenzo	University of Modena and Reggio Emilia
Keywords: Distributed Robot Systems, Multi-Robot Systems, Networked Robots Abstract: Distributed multi-robot teams are increasingly used for optimal coverage of domains with unknown density distributions, often modeled with Gaussian Processes (GPs). However, current methods rely on data sharing, raising privacy concerns and computational issues. We propose a Federated Learning (FL) approach that enables collaborative training of GP models without sharing raw data. To enhance scalability and efficiency, we introduce a filtering strategy that selects relevant data samples, minimizing computational load. Realistic simulations emulating real scenarios demonstrate the effectiveness of our method in achieving robust environmental estimates with minimal data sharing and reduced complexity.

11:35-11:40, Paper ThCT10.5	Add to My Program
Constrained Learning for Decentralized Multi-Objective Coverage Control

Cervino, Juan	MIT
Agarwal, Saurav	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Ribeiro, Alejandro	University of Pennsylvania
Keywords: Deep Learning Methods, Autonomous Vehicle Navigation, Multi-Robot Systems Abstract: The multi-objective coverage control problem requires a robot swarm to collaboratively provide sensor coverage to multiple heterogeneous importance density fields (IDFs) simultaneously. We pose this as an optimization problem with constraints and study two different formulations: (1) Fair coverage, where we minimize the maximum coverage cost for any field, promoting equitable resource distribution among all fields; and (2) Constrained coverage, where each field must be covered below a certain cost threshold, ensuring that critical areas receive adequate coverage according to predefined importance levels. We study the decentralized setting where robots have limited communication and local sensing capabilities, making the system more realistic, scalable, and robust. Given the complexity, we propose a novel decentralized constrained learning approach that combines primal-dual optimization with a Learnable Perception-Action-Communication (LPAC) neural network architecture. We show that the Lagrangian of the dual problem can be reformulated as a linear combination of the IDFs, enabling the LPAC policy to serve as a primal solver. We empirically demonstrate that the proposed method (i) significantly outperforms state-of-the-art decentralized controllers by 30% on average in terms of coverage cost, (ii) transfers well to larger environments with more robots, and (iii) is scalable in the number of IDFs and robots in the swarm.

11:40-11:45, Paper ThCT10.6	Add to My Program
Di-NeRF: Distributed NeRF for Collaborative Learning with Relative Pose Refinement

Asadi, Mahboubeh	Toronto Metropolitan University
Zareinia, Kourosh	Toronto Metropolitan University
Saeedi, Sajad	Toronto Metropolitan University
Keywords: Distributed Robot Systems, Mapping, Multi-Robot SLAM Abstract: Collaborative mapping of unknown environments can be done faster and more robustly than a single robot. However, a collaborative approach requires a distributed paradigm to be scalable and deal with communication issues. This work presents a fully distributed algorithm enabling a group of robots to collectively optimize the parameters of a Neural Radiance Field (NeRF). The algorithm involves the communication of each robot's trained NeRF parameters over a mesh network, where each robot trains its NeRF and has access to its own visual data only. Additionally, the relative poses of all robots are jointly optimized alongside the model parameters, enabling mapping with less accurate relative camera poses. We show that multi-robot systems can benefit from differentiable and robust 3D reconstruction optimized from multiple NeRFs. Experiments on real-world and synthetic data demonstrate the efficiency of the proposed algorithm. See the website of the project for videos of the experiments and supplementary material https://sites.google.com/view/di-nerf/home.


ThCT11 Regular Session, 314	Add to My Program
Haptics 2

Chair: Kyung, Ki-Uk	Korea Advanced Institute of Science & Technology (KAIST)
Co-Chair: Dills, Patrick	University of Wisconsin - Madison

11:15-11:20, Paper ThCT11.1	Add to My Program
A Hybrid Haptic Device for Virtual Car Door Interactions: Design and Implementation

Ma, Jihyeong	Korea Advanced Institute of Science and Technology
Kim, Ji-Sung	KAIST
Kyung, Ki-Uk	Korea Advanced Institute of Science & Technology (KAIST)
Keywords: Haptics and Haptic Interfaces, Virtual Reality and Interfaces, Compliance and Impedance Control Abstract: As cars evolve from mere modes of transportation into living spaces, the importance of haptic interaction with vehicles is increasing. Here, we introduce a hybrid haptic device for the virtual prototyping of car doors, employing both the motor and brake. Physical prototyping, which is a conventional method for product designing, is often expensive and time-consuming. As a valuable alternative, virtual prototyping with a haptic device that delivers realistic haptic feedback can be utilized. However, replicating the substantial torque of a car door requires a high torque capacity motor, which can potentially pose safety risks to the user during haptic interaction. The proposed hybrid haptic device, combining a servo motor and a magnetic powder brake, effectively renders the dynamics of car doors. We experimentally measured the door's torque profile and confirmed significant friction from the door check mechanism and hinge. The torque profile was divided into active and passive torque, and each torque was distributed to the motor and brake, respectively. Finally, the proposed device and control method demonstrate the capability to accurately render the car door's kinesthetic haptic feedback, confirming its potential as an efficient tool for virtual prototyping in automotive design.

11:20-11:25, Paper ThCT11.2	Add to My Program
RAR-6: An Optimized Reconfigurable Asymmetric 6-DOF Haptic Robot for Gross and Fine Motor Tasks

Zhang, Changqi	SINOPEC Research Institute of Petroleum Engineering Co., Ltd
Wang, Cui	Southern University of Science and Technology
Wang, Congzhe	Chongqing University of Posts and Telecommunications
Zhang, Mingming	Southern University of Science and Technology
Keywords: Haptics and Haptic Interfaces, Optimization and Optimal Control, Mechanism Design Abstract: Robot-assisted task-oriented training demonstrates immense potential in rehabilitation area. Parallel robots, with advantages such as low inertia and high stiffness, facilitate precise haptic feedback, yet their application in rehabilitation is limited by workspace constraints. To this end, we propose a design scheme for a haptic robot based on a reconfigurable asymmetric parallel mechanism. We first introduce a two-stage multi-objective optimization method to obtain the optimal parameter configurations. Then, to achieve precise assembling of the reconfigurable mechanism in each configuration, corresponding positioning mechanisms are designed. System performance tests validate the robot’s capabilities under different configurations: workspace meets design requirements, stiffness output reaches 30 N/mm, force output is 40 N, RMS of maximum back-driven force along x, y, and z axes is 7.5 N, and RMS of maximum back-driven torque around x and y axes is 567.4 N∙mm. Target tracking and virtual channel trajectory tracking experiments demonstrate the system’s haptic rendering ability for gross motor tasks (GMTs) and fine motor tasks (FMTs), respectively. The developed 6-DOF haptic robot holds promise for versatile task-oriented rehabilitation training.

11:25-11:30, Paper ThCT11.3	Add to My Program
Design, Implementation, and Validation of an Ungrounded Visuo-Tactile Haptic Interface for Robotic Teleoperation in High-Risk Steel Production

Park, Jaehyun	Pohang University of Science and Technology
Choi, Il Seop	POSCO HOLDINGS
Choi, Sang-Woo	PoscoHoldings
Kim, Keehoon	POSTECH, Pohang University of Science and Technology
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Robotics in Hazardous Fields Abstract: Haptic devices are widely used as control interfaces for robotic teleoperation, offering intuitive rendering of interactions between remote robot and environment. In particular, cutaneous feedback devices provide intrinsic stability and reduced form factor compared to kinesthetic feedback interfaces. However, the implementation of cutaneous feedback devices in industrial settings must be rigorously validated to prevent potential equipment accidents, which could lead to substantial economic losses due to unskilled robot manipulation. This paper presents a novel ungrounded haptic control interface (POstick-VF), designed specifically for high-risk steel production tasks. POstick-VF offers visuo-tactile feedback within an extensive workspace, enabling intuitive robot manipulation through its kinematic similarity with real tools ensuring safety. The performance of the developed POstick is rigorously validated and compared with conventional joystick controller through experiments conducted with an on-site hydraulic robot.

11:30-11:35, Paper ThCT11.4	Add to My Program
Enhanced Tiny Haptic Dial with T-Shaped Shaft Based on Magnetorheological Fluid

Heo, Yong Hae	Korea University of Technology and Education
Kim, Seongho	Korea University of Technology and Education
Kim, Sang-Youn	Korea Univ. Technology & Education
Keywords: Haptics and Haptic Interfaces, Touch in HRI Abstract: This paper introduces a tiny haptic dial utilizing magnetorheological fluid (MRF) to enhance its resistive torque feedback. Moreover, we design the T-shaped rotary shaft with bumps and embed it into the haptic dial to enhance its haptic performance (resistive torque). This structure enables two operation modes (shear and flow) of MRF that contribute to the actuation simultaneously in the proposed haptic dial. This structure allows the magnetic flux to flow towards the MRF, helping further maximize the resistive torque. We conduct a simulation to confirm that the magnetic flux generated from a solenoid forms a closed-loop magnetic path without magnetic saturation or leakage in the proposed haptic dial. The resistive torque of the proposed haptic dial varied from 8 N·mm to 47 N·mm as the input current changed from 0 to 300 mA, thus indicating that the proposed haptic dial can create a variety of haptic sensations in a tiny size (diameter: 20 mm; height:20 mm).

11:35-11:40, Paper ThCT11.5	Add to My Program
Path-Constrained Haptic Motion Guidance Via Adaptive Phase-Based Admittance Control

Shahriari, Erfan	Boston Dynamics AI Institute
Svarny, Petr	CTU in Prague, FEE
Baradaran Birjandi, Seyed Ali	Technical University of Munich
Hoffmann, Matej	Czech Technical University in Prague, Faculty of Electrical Engi
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Motion and Path Planning, Physical Human-Robot Interaction, Robust/Adaptive Control of Robotic Systems, Motion Control Abstract: Robots have surpassed humans in terms of strength and precision, yet humans retain an unparalleled ability for decision-making in the face of unpredictable disturbances. This article aims to combine the strengths of both entities within a singular task: human motion guidance under strict geometric constraints, particularly adhering to predetermined paths. To tackle this challenge, a modular haptic guidance law is proposed that takes the human-applied wrench as an input. Using an auxiliary variable called phase, the generated desired motion is guaranteed to consistently adhere to the constraint path. It is demonstrated how the guidance policy can be generalized into physically interpretable terms, adjustable either prior to initiating the task or dynamically while the task is in progress. Additionally, an illustrative guidance adaptation policy is showcased that takes into account the human's manipulability. Leveraging passivity analysis, potential sources of instability are pinpointed, and subsequently, overall system stability is ensured by incorporating an augmented virtual energy tank. Lastly, a comprehensive set of experiments, including a 20-participant user study, explores various aspects of the approach in practice, encompassing both technical and usability consideration.

11:40-11:45, Paper ThCT11.6	Add to My Program
A Pneumatic-Actuated Feel-Through Wearable Haptic Display for Multi-Cue Delivery

Pagnanelli, Giulia	University of Pisa
Latella, Giovanni	University of Pisa
Catalano, Manuel Giuseppe	Istituto Italiano Di Tecnologia
Bianchi, Matteo	University of Pisa
Keywords: Haptics and Haptic Interfaces, Wearable Robotics, Mechanism Design Abstract: Compared to the ”Seeing-through” paradigm for the concurrent display of both real and virtual images in vision-enabled Augmented Reality (AR), its haptic counterpart, i.e., the ”Feeling-through” via wearable tactile systems, which enables to experience simultaneously physical objects and haptically rendered virtual properties, is still largely unexplored. In a previous work, we introduced the Wearable-Fabric Yielding Display (W-FYD), which uses an elastic thin fabric as the interaction surface with the finger, allowing the delivery of softness-related cues both in active and passive exploration mode, together with sliding stimuli. The device was proven effective, but the current design faces form factor issues related to the dimensions and weight of the device, due to the actuation strategy of the lifting mechanism in the passive mode. To tackle this issue, we propose a miniaturized version of the system, named the W-FYD AIR, which allows reducing the overall dimensions of the device, from 100 × 60 × 36 mm to 78 × 45 × 37 mm, and its weight, from 100 g to 54 g, by exploiting pneumatically-actuated chambers for the lifting mechanism. Through careful sizing of each component and a process of characterization and identification, we demonstrated that the new system attained the same characteristics and functionality as the original one.


ThCT12 Regular Session, 315	Add to My Program
Big Data

Chair: Xu, Danfei	Georgia Institute of Technology
Co-Chair: Shi, Guangyao	University of Southern California

11:15-11:20, Paper ThCT12.1	Add to My Program
How Generalizable Is My Behavior Cloning Policy? a Statistical Approach to Trustworthy Performance Evaluation

Vincent, Joseph	Stanford University
Nishimura, Haruki	Toyota Research Institute
Itkina, Masha	Stanford University
Shah, Paarth	University of Oxford
Schwager, Mac	Stanford University
Kollar, Thomas	Toyota Research Institute
Keywords: Performance Evaluation and Benchmarking, Probability and Statistical Methods, AI-Enabled Robotics Abstract: With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by distribution shifts causing unpredictable changes in performance during deployment. To rigorously evaluate behavior cloning policies, we present a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts. Notably, by applying the standard stochastic ordering to robot performance distributions, we provide a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task. We build upon established statistical results to ensure that the bounds hold with a user-specified confidence level and tightness, and are constructed from as few policy rollouts as possible. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware. Specifically, we (i) empirically validate the guarantees of the bounds in simulated manipulation settings, (ii) find the degree to which a learned policy deployed on hardware generalizes to new real-world environments, and (iii) rigorously compare two policies tested in out-of-distribution settings. Our experimental data, code, and implementation of confidence bounds are open-source.

11:20-11:25, Paper ThCT12.2	Add to My Program
Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Liu, Ying	Northeastern University, China
Hua, Yijing	Northeastern University, China
Chai, Haojiang	Northeastern University, China
Wang, Yanbo	Northeastern University, China
TengQi, Ye	Articul8 AI
Keywords: Data Sets for Robotic Vision, Computer Vision for Automation, Object Detection, Segmentation and Categorization Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of fine-grained captions and careful attention to fine-grained details in images in order to accurately detect fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique. Our data, annotations, and codes are available at https://github.com/tengerye/3FOVD.

11:25-11:30, Paper ThCT12.3	Add to My Program
GPU-Accelerated Subsystem-Based ADMM for Large-Scale Interactive Simulation

Ji, Harim	Seoul National University
Kim, Hyunsu	Seoul National University
Lee, Jeongmin	Seoul National University
Lee, Somang	Seoul National University
An, Seoki	Seoul National University
Heo, Jinuk	Seoul National University
Lee, Youngseon	Seoul National University
Lee, Yongseok	Seoul National University
Lee, Dongjun	Seoul National University
Keywords: Simulation and Animation, Virtual Reality and Interfaces, Haptics and Haptic Interfaces Abstract: In this paper, we implement the GPU-accelerated subsystem-based Alternating Direction Method of Multipliers (SubADMM) for interactive simulation. The challenging objective for interactive simulations is to deliver realistic results under tight performance, even for large-scale scenarios. We aim to achieve this by exploiting the parallelizable nature of SubADMM to the fullest extent. We introduce a new subsystem division strategy to make SubADMM `GPU friendly' along with custom kernel designs and optimization regarding efficient memory access patterns. We successfully implement the GPU-accelerated SubADMM and show the accuracy and speed of the framework for large-scale scenarios, highlighted with an interactive `Hand demo' scenario. We also show improved robustness and accuracy compared to other state-of-the-art interactive simulators with several challenging scenarios that introduce large-scale ill-conditioned dynamics problems.

11:30-11:35, Paper ThCT12.4	Add to My Program
Local Policies Enable Zero-Shot Long-Horizon Manipulation

Dalal, Murtaza	Carnegie Mellon University
Liu, Min	Carnegie Mellon University
Talbott, Walter	Apple
Chen, Chen	Apple
Pathak, Deepak	Carnegie Mellon University
Zhang, Jian	Purdue University
Salakhutdinov, Ruslan	University of Toronto
Keywords: Big Data in Robotics and Automation, Machine Learning for Robot Control, Deep Learning Methods Abstract: Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA and LLMTrajGen across 50 real-world manipulation tasks by 36%, 76% and 62% respectively. All code, models and datasets will be released. Video results at manipgen.github.io

11:35-11:40, Paper ThCT12.5	Add to My Program
DART: Dexterous Augmented Reality Teleoperation Platform for Large-Scale Robot Data Collection in Simulation

Park, Younghyo	MIT
Bhatia, Jagdeep	Massachusetts Institute of Technology
Ankile, Lars	Massachusetts Institute of Technology
Agrawal, Pulkit	MIT
Keywords: Data Sets for Robot Learning, Telerobotics and Teleoperation, Virtual Reality and Interfaces Abstract: The scarcity of diverse and high-quality data impedes the quest to build a generalist robotic system. Current robotics data collection efforts face many challenges: the need for physical robotic hardware, setting up the environment, frequent resets, and the fatigue for data collectors operating real robots. We introduce DART, a teleoperation platform designed for crowdsourcing that reimagines robotic data collection by leveraging cloud-based simulation and augmented reality (AR) to address many limitations of prior data collection efforts. User studies show that DART enables higher data collection throughput and lower physical fatigue than real-world teleoperation. We also demonstrate that policies trained using DART-collected datasets successfully transfer to reality and are robust to unseen visual disturbances. All data collected through DART is automatically stored in a cloud-hosted database, DexHub, paving the path for an ever-growing data hub for robot learning.


ThCT13 Regular Session, 316	Add to My Program
Motion Prediction

Chair: Liang, Xiao	Texas A&M University
Co-Chair: Stiffler, Nicholas	University of Dayton

11:15-11:20, Paper ThCT13.1	Add to My Program
TransFusion: A Practical and Effective Transformer-Based Diffusion Model for 3D Human Motion Prediction

Tian, Sibo	Texas A&M University
Zheng, Minghui	Texas A&M University
Liang, Xiao	Texas A&M University
Keywords: Human-Robot Collaboration, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: Predicting human motion plays a crucial role in ensuring a safe and effective human-robot close collaboration in intelligent remanufacturing systems of the future. Existing works can be categorized into two groups: those focusing on accuracy, predicting a single future motion, and those generating diverse predictions based on observations. The former group fails to address the uncertainty and multi-modal nature of human motion, while the latter group often produces motion sequences that deviate too far from the ground truth or become unrealistic within historical contexts. To tackle these issues, we propose TransFusion, an innovative and practical diffusion-based model for 3D human motion prediction which can generate samples that are more likely to happen while maintaining a certain level of diversity. Our model leverages Transformer as the backbone with long skip connections between shallow and deep layers. Additionally, we employ the discrete cosine transform to model motion sequences in the frequency space, thereby improving performance. In contrast to prior diffusion-based models that utilize extra modules like cross-attention and adaptive layer normalization to condition the prediction on past observed motion, we treat all inputs, including conditions, as tokens to create a more practical and effective model compared to existing approaches. Extensive experimental studies are conducted on benchmark datasets to validate the effectiveness of our human motion prediction model. The project page is available at https://github.com/sibotian96/TransFusion.

11:20-11:25, Paper ThCT13.2	Add to My Program
DE-TGN: Uncertainty-Aware Human Motion Forecasting Using Deep Ensembles

Eltouny, Kareem	Simpson Gumpertz & Heger
Liu, Wansong	University at Buffalo
Tian, Sibo	Texas A&M University
Zheng, Minghui	Texas A&M University
Liang, Xiao	Texas A&M University
Keywords: Human-Robot Collaboration, Computer Vision for Automation, Deep Learning for Visual Perception Abstract: Ensuring the safety of human workers in a collaborative environment with robots is of utmost importance. Although accurate pose prediction models can help prevent collisions between human workers and robots, they are still susceptible to critical errors. In this study, we propose a novel approach called deep ensembles of temporal graph neural networks (DE-TGN) that not only accurately forecast human motion but also provide a measure of prediction uncertainty. By leveraging deep ensembles and employing stochastic Monte-Carlo dropout sampling, we construct a volumetric field representing a range of potential future human poses based on covariance ellipsoids. To validate our framework, we conducted experiments using three motion capture datasets including Human3.6M, and two human-robot interaction scenarios, achieving state-of-the-art prediction error. Moreover, we discovered that deep ensembles not only enable us to quantify uncertainty but also improve the accuracy of our predictions.

11:25-11:30, Paper ThCT13.3	Add to My Program
A Large-Scale Dataset for Humanoid Robotics Enabling a Novel Data-Driven Fall Prediction

Urbann, Oliver	Fraunhofer IML
Eßer, Julian	Fraunhofer IML
Kleingarn, Diana	TU Dortmund University
Moos, Arne	Robotics Research Institute
Brämer, Dominik	Fraunhofer IML
Brömmel, Piet	Fraunhofer IML
Bach, Nicolas	Fraunhofer IML
Jestel, Christian	Fraunhofer IML
Larisch, Aaron	TU Dortmund University
Kirchheim, Alice	TU Dortmund
Keywords: Humanoid and Bipedal Locomotion, Failure Detection and Recovery, Data Sets for Robot Learning Abstract: In this paper, we present a comprehensive dataset comprising 37.9 hours of sensor data collected from humanoid robots, including 18.3 hours of walking and 2,519 recorded falls. This extensive dataset is a valuable resource for various robotics and machine learning applications. Leveraging this data, we propose RePro-TCN, a Temporal Convolutional Network (TCN) enhanced with two novel extensions: Relaxed Loss Formulation and Progressive Forecasting. Predicting falls is a critical capability in humanoid robotics for implementing countermeasures such as lunging or stopping the walk. Thanks to the new dataset, we train RePro-TCN and demonstrate its superiority over previous approaches under real-world conditions that were previously unattainable.

11:30-11:35, Paper ThCT13.4	Add to My Program
Social-MAE: Social Masked Autoencoder for Multi-Person Motion Representation Learning

Ehsanpour, Mahsa	University of Adelaide
Reid, Ian	University of Adelaide
Rezatofighi, Hamid	Monash University
Keywords: Deep Learning for Visual Perception, Recognition, Human-Centered Robotics Abstract: For seamless robot navigation, it’s vital to thoroughly understand multi-person scenes, which requires moving beyond simple tasks such as detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints’ trajectory. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.

11:35-11:40, Paper ThCT13.5	Add to My Program
Depth-Temporal Attention with Dual Modality Data for Walking Intention Prediction in Close-Proximity Front-Following

Zhao, Chongyu	The University of Hong Kong
Guo, Lingyu	The University of Hong Kong
Wen, Rongwei	The University of Hong Kong
Wang, Yanrui	The University of Hong Kong
Wu, Chuan	The University of Hong Kong
Keywords: Human Detection and Tracking, Intention Recognition, Visual Learning Abstract: The role of robot following is crucial for effective human-robot collaboration. Traditional methods often rely on maintaining a significant distance between the robot and the human, which limits interaction and responsiveness. In contrast, close-proximity front-following facilitates immediate engagement, enhancing user experience and improving human-robot interaction. Nonetheless, it presents challenges in accurately interpreting human walking intentions due to a restricted observational field. In our paper, we introduce an innovative Depth-Temporal Attention Network that takes lower-limb depth images and robot motor signals as input, to accurately predict human walking intentions. This network leverages a depth attention module to capture essential spatial features and integrates a temporal attention mechanism to analyze movement dynamics. To enhance generalization, we use a domain adversarial module that focuses on shared features across diverse walking data, ensuring consistent performance across users. Experimental results demonstrate that our approach achieves an impressive average intention prediction accuracy of 91.09%, significantly surpassing baseline models by 12.59% to 23.66%. Additionally, an ablation study reveals that the depth-attention module substantially improves the model's understanding of depth features, resulting in an 11.44% increase in accuracy. With this high prediction accuracy, smooth front-following is achieved at close-proximity.

11:40-11:45, Paper ThCT13.6	Add to My Program
UPTor: Unified 3D Human Pose Dynamics and Trajectory Prediction for Human-Robot Interaction

Nilavadi, Nisarga	University of Technology Nuremberg
Rudenko, Andrey	Robert Bosch GmbH
Linder, Timm	Robert Bosch GmbH
Keywords: Human Detection and Tracking, Human Factors and Human-in-the-Loop, Datasets for Human Motion Abstract: We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset "DARKO" with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at https://nisarganc.github.io/UPTor-page/


ThCT14 Regular Session, 402	Add to My Program
Scene Reconstruction Using Radiance Fields

Chair: Schwertfeger, Sören	ShanghaiTech University
Co-Chair: Zakharov, Sergey	Toyota Research Institute

11:15-11:20, Paper ThCT14.1	Add to My Program
Category-Level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment

Lee, Taekbeom	Seoul National University
Jang, Youngseok	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Mapping, Semantic Scene Understanding, Visual Learning Abstract: Neural implicit representation has attracting at- tention in 3D reconstruction through various success cases. For further applications such as scene understanding or editing, sev- eral works have shown progress towards object-compositional reconstruction. Despite their superior performance in observed regions, their performance is still limited in reconstructing ob- jects that are partially observed. To better treat this problem, we introduce a category-level neural fields which learns meaningful common 3D information among objects belonging to the same category present in the scene. Our key idea is to subcategorize objects based on their observed shape for better training of category-level model. Then we take advantage of the neural field to conduct the challenging task of registering partially observed objects by selecting and aligning against representa- tive objects selected by ray-based uncertainty. Experiments on both simulation and real-world dataset demonstrate that our method improve reconstruction of unobserved part for several categories.

11:20-11:25, Paper ThCT14.2	Add to My Program
PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields

Chen, Zheng	Indiana University Bloomington
Yan, Qingan	Goertek US
Zhan, Huangying	The University of Adelaide
Cai, Changjiang	Stevens Institute of Technology
Xu, Xiangyu	OPPO
Huang, Yuzhong	University of Southern California
Wang, Weihan	Stevens Institute of Technology
Feng, Ziyue	Clemson University
Xu, Yi	OPPO US Research Center
Liu, Lantao	Indiana University
Keywords: RGB-D Perception, Recognition Abstract: Identifying spatially complete planar primitives from visual data is a crucial task in computer vision. Prior methods are largely restricted to either 2D segment recovery or simplifying 3D structures, even with extensive plane annotations. We present PlanarNeRF, a novel framework capable of detecting dense 3D planes through online learning. Drawing upon the neural field representation, PlanarNeRF brings three major contributions. First, it enhances 3D plane detection with concurrent appearance and geometry knowledge. Second, a lightweight plane fitting module is used to estimate plane parameters. Third, a novel global memory bank structure with an update mechanism is introduced, ensuring consistent cross-frame correspondence. The flexible architecture of PlanarNeRF allows it to function in both 2D-supervised and self-supervised solutions, in each of which it can effectively learn from sparse training signals, significantly improving training efficiency. Through extensive experiments, we demonstrate the effectiveness of PlanarNeRF in various real-world scenarios and remarkable improvement in 3D plane detection over existing works.

11:25-11:30, Paper ThCT14.3	Add to My Program
FreeDriveRF: Monocular RGB Dynamic NeRF without Poses for Autonomous Driving Via Point-Level Dynamic-Static Decoupling

Wen, Yue	Shanghai Jiao Tong University
Song, Liang	Dimension
Liu, Yijia	China University of Mining and Technology
Zhu, Siting	Shanghai Jiao Tong University
Miao, Yanzi	China University of Mining and Technology
Han, Lijun	Shanghai Jiao Tong University
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Computer Vision for Automation Abstract: Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level, avoiding image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving.

11:30-11:35, Paper ThCT14.4	Add to My Program
LLGS: Unsupervised Gaussian Splatting for Image Enhancement and Reconstruction in Pure Dark Environment

Wang, Haoran	The University of Sussex
Huang, Jingwei	University of Electronic Science and Technology of China
Yang, Lu	University of Electronic Science and Technology of China
Deng, Tianchen	Shanghai Jiao Tong University
Zhang, Gaojing	University of Sussex
Li, Mingrui	Dalian University of Technology
Keywords: Visual Learning, Deep Learning for Visual Perception, Computer Vision for Automation Abstract: 3D Gaussian Splatting has shown remarkable capabilities in novel view rendering tasks and exhibits significant potential for multi-view optimization. However, the original 3D Gaussian Splatting lacks color representation for inputs in lowlight environments. Simply using enhanced images as inputs would lead to issues with multi-view consistency, and current single-view enhancement systems rely on pre-trained data, lacking scene generalization. These problems limit the application of 3D Gaussian Splatting in low-light conditions in the field of robotics, including high-fidelity modeling and feature matching. To address these challenges, we propose an unsupervised multiview stereoscopic system based on Gaussian Splatting, called Low-Light Gaussian Splatting (LLGS). This system aims to enhance images in low-light environments while reconstructing the scene. Our method introduces a decomposable Gaussian representation called M-Color, which separately characterizes color information for targeted enhancement. Furthermore, we propose an unsupervised optimization method with zeroknowledge priors, using direction-based enhancement to ensure multi-view consistency. Experiments conducted on real-world datasets demonstrate that our system outperforms state-of-theart methods in both low-light enhancement and 3D Gaussian Splatting.

11:35-11:40, Paper ThCT14.5	Add to My Program
Hash-GS: Anchor-Based 3D Gaussian Splatting with Multi-Resolution Hash Encoding for Efficient Scene Reconstruction

Xie, Yijia	Zhejiang University
Lin, Yuhang	Zhejiang University
Li, Laijian	Zhejiang University
Liu, Lina	Zhejiang University
Wei, Xiaobin	Wasu Media&Network Co..Ltd
Liu, Yong	Zhejiang University
Lv, Jiajun	Zhejiang University
Keywords: Visual Learning, Mapping, Deep Learning Methods Abstract: Realistic 3D object and scene reconstruction is pivotal in advancing fields such as world model simulation and embodied intelligence. In this paper, we introduce Hash-GS, a storage-efficient method for large-scale scene reconstruction using anchor-based 3D Gaussian Splatting (3DGS). The vanilla 3DGS struggles with high memory demands due to the large number of primitives, especially in complex or extensive scenes. Hash-GS addresses these challenges with a compact representation by leveraging high-dimensional features to parameterize primitive properties, stored in compact hash tables, which reduces memory usage while preserving rendering quality. It also incorporates adaptive anchor management to efficiently control the number of anchors and neural Gaussians. Additionally, we introduce an analytic 3D smoothing filter to mitigate aliasing and support Level-of-Detail for optimized rendering across varying intrinsic parameters. Experimental results on several datasets demonstrate that Hash-GS improves storage efficiency while maintaining competitive rendering performance, especially in large-scale scenes.

11:40-11:45, Paper ThCT14.6	Add to My Program
Elite-EvGS: Learning Event-Based 3D Gaussian Splatting by Distilling Event-To-Video Priors

Zhang, Zixin	HKUST-GZ
Chen, Kanghao	Hong Kong University of Science and Technology (Guangzhou)
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Visual Learning, Deep Learning for Visual Perception, Mapping Abstract: Event cameras are bio-inspired sensors that output asynchronous and sparse event streams, instead of fixed frames. Benefiting from their distinct advantages, such as high dynamic range and high temporal resolution, event cameras have been applied to address 3D reconstruction, important for robotic mapping. Recently, neural rendering techniques, such as 3D Gaussian splatting (3DGS), have been shown successful in 3D reconstruction. However, it still remains under-explored how to develop an effective event-based 3DGS pipeline. In particular, as 3DGS typically depends on high-quality initialization and dense multiview constraints, a potential problem appears for the 3DGS optimization with events given its inherent sparse property. To this end, we propose a novel event-based 3DGS framework, named textbf{Elite-EvGS}. Our key idea is to distill the prior knowledge from the off-the-shelf event-to-video (E2V) models to effectively reconstruct 3D scenes from events in a coarse-to-fine optimization manner. Specifically, to address the complexity of 3DGS initialization from events, we introduce a novel textit{warm-up initialization strategy} that optimizes a coarse 3DGS from the frames generated by E2V models and then incorporates events to refine the details. Then, we propose a textit{progressive event supervision strategy} that employs the window-slicing operation to progressively reduce the number of events used for supervision. This subtly relives the temporal randomness of the event frames, benefiting the optimization of local textural and global structural details. Experiments on the benchmark datasets demonstrate that Elite-EvGS can reconstruct 3D scenes with better textural and structural details. Meanwhile, our method yields plausible performance on the captured real-world data, including diverse challenging conditions, such as fast motion and low light scenes. For demo and more results, please check our project page: https://vlislab22.github.io/elite-evgs/


ThCT15 Regular Session, 403	Add to My Program
Continuum Robots 2

Chair: Krishnan, Girish	University of Illinois Urbana Champaign
Co-Chair: Alambeigi, Farshid	University of Texas at Austin

11:15-11:20, Paper ThCT15.1	Add to My Program
Hysteresis Compensation of Flexible Continuum Manipulator Using RGBD Sensing and Temporal Convolutional Network

Park, Junhyun	DGIST
Jang, Seonghyeok	DGIST
Park, Hyojae	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Bae, Seongjun	DGIST
Hwang, Minho	Daegu Gyeongbuk Instituute of Science and Technology (DGIST)
Keywords: Tendon/Wire Mechanism, Machine Learning for Robot Control, Modeling, Control, and Learning for Soft Robots Abstract: Flexible continuum manipulators are valued for minimally invasive surgery, offering access to confined spaces through nonlinear paths. However, cable-driven manipulators face control difficulties due to hysteresis from cabling effects such as friction, elongation, and coupling. These effects are difficult to model due to nonlinearity and the difficulties become even more evident when dealing with long and coupled, multi-segmented manipulator. This paper proposes a data-driven approach based on Deep Neural Networks (DNN) to capture these nonlinear and previous states-dependent characteristics of cable actuation. We collect physical joint configurations according to command joint configurations using RGBD sensing and 7 fiducial markers to model the hysteresis of the proposed manipulator. Result on a study comparing the estimation performance of four DNN models show that the Temporal Convolution Network (TCN) demonstrates the highest predictive capability. Leveraging trained TCNs, we build a control algorithm to compensate for hysteresis. Tracking tests in task space using unseen trajectories show that the proposed control algorithm reduces the average position and orientation error by 61.39% (from 13.7mm to 5.29 mm) and 64.04% (from 31.17° to 11.21°), respectively. This result implies that the proposed calibrated controller effectively reaches the desired configurations by estimating the hysteresis of the manipulator. Applying this method in real surgical scenarios has the potential to enhance control precision and improve surgical performance.

11:20-11:25, Paper ThCT15.2	Add to My Program
Command Filtered Cartesian Impedance Control for Tendon Driven Continuum Manipulators with Actuator Fault Compensation

Zheng, Xianjie	Nanjing University of Science and Technology
Yu, Zhaobao	Nanjing University of Science and Technology
Ding, Meng	Nanjing University of Science and Technology
Liu, Liaoxue	Nanjing University of Science and Technology
Guo, Jian	Nanjing Univ. of Sci. & Tech
Guo, Yu	Nanjing University of Science and Technology
Keywords: Modeling, Control, and Learning for Soft Robots, Compliance and Impedance Control Abstract: Continuum robots are well-suited for constrained environments due to their superior flexibility and structural compliance. However, relying solely on passive compliance may lead to damage to both the robot and the surrounding environment. This work proposes a finite-time Cartesian impedance control scheme for tendon-driven continuum manipulators (TDCMs), where a second-order low-pass filter is used to adjust the reference trajectory according to the external robot tip force. The controller is designed using the command filtered backstepping method, and the finite-time stability is established by the designed Lyapunov function. In TDCM systems, the tendons operate antagonistically, and actuators often fail to quickly reach the desired tendon tension, leading to partial failures. To address this, we propose an actuator fault compensation algorithm to enhance system performance and reliability. We conducted trajectory tracking experiments on a multi-segment TDCM prototype, the results demonstrate that the designed Cartesian impedance controller achieves effective compliance control effect and high position control accuracy.

11:25-11:30, Paper ThCT15.3	Add to My Program
A Synergistic Framework for Learning Shape Estimation and Shape-Aware Whole-Body Control Policy for Continuum Robots

Kasaei, Mohammadreza	University of Edinburgh
Alambeigi, Farshid	University of Texas at Austin
Khadem, Mohsen	University of Edinburgh
Keywords: Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control, Soft Robot Applications Abstract: In this paper, we present a novel synergistic framework for learning shape estimation and a shape-aware whole-body control policy for continuum robots. Our approach leverages the interaction between two Augmented Neural Ordinary Differential Equations (ANODEs) - the Shape-NODE and Control-NODE - to achieve continuous shape estimation and shape-aware control. The Shape-NODE integrates prior knowledge from Cosserat rod theory, allowing it to adapt and account for model mismatches, while the Control-NODE uses this shape information to optimize a whole-body control policy, trained in a Model Predictive Control (MPC) fashion. This unified framework effectively overcomes limitations of existing data-driven methods, such as poor shape awareness and challenges in capturing complex nonlinear dynamics. Extensive evaluations in both simulation and real-world environments demonstrate the framework’s robust performance in shape estimation, trajectory tracking, and obstacle avoidance. The proposed method consistently outperforms state-of-the-art end-to-end, Neural-ODE, and Recurrent Neural Network (RNN) models, particularly in terms of tracking accuracy and generalization capabilities.

11:30-11:35, Paper ThCT15.4	Add to My Program
On the Benefits of Hysteresis in Tendon Driven Continuum Robots

Hanley, David	University of Edinburgh
Alambeigi, Farshid	University of Texas at Austin
Khadem, Mohsen	University of Edinburgh
Keywords: Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots, Surgical Robotics: Steerable Catheters/Needles Abstract: Hysteresis in the tendons driving continuum robots is frequently regarded as a nuisance and a problem that is best avoided. Some prior work seeks to ameliorate the effects of hysteresis through the selection of materials. Others propose models of hysteresis to compensate for their effects. In this work, we present an empirically validated model of hysteresis in tendon-driven continuum robots. We demonstrate that hysteresis contributes to the stability of these robots by mitigating undesirable tensions in robot's backbone. As a result, a model-based approach to hysteresis can be used not just for compensation of a nuisance, but to enhance the utility of continuum robots in safety critical applications such as medical robots.

11:35-11:40, Paper ThCT15.5	Add to My Program
Automating Tension Calibration for Tendon-Driven Continuum Robots: A Low-Cost Approach towards Consistent Teleoperation

Lee, Kyum	University of Toronto
Shentu, Chengnan	University of Toronto
Pogue, Chloe	University of Toronto
Burgner-Kahrs, Jessica	University of Toronto
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications Abstract: We present a low-cost method to automate tension calibration for tendon-driven continuum robots (TDCRs), particularly those lacking tension sensing. Our method utilizes Hall effect sensors to localize the robot’s tip with respect to the one-dimensional trajectory it follows under individual tendon actuation. We propose two workflows for robots with and without a static model, making the method generalizable to other tendon-driven soft robots. We demonstrate our method’s ability to repeatably tension the tendons through associated tendon displacements. The calibration approach’s measured repeatability (±0.03 mm) is also benchmarked against manual calibration on a TDCR prototype, and its accuracy in achieving target tensions is assessed ((0.06±0.20) N). We further investigate how tension calibration impacts open-loop tracking accuracy, confirming the effectiveness of our method to enhance motion consistency in open-loop control and teleoperation.

11:40-11:45, Paper ThCT15.6	Add to My Program
A Neural Network-Based Framework for Fast and Smooth Posture Reconstruction of a Soft Continuum Arm

Wang, Tixian	University of Illinois at Urbana-Champaign
Chang, Heng-Sheng	University of Illinois Urbana-Champaign
Kim, Seung Hyun	University of Illinois at Urbana-Champaign
Guo, Jiamiao	University of Illinois Urbana-Champaign
Akcal, M. Ugur	University of Illinois Urbana-Champaign
Walt, Benjamin	University of Illinois Urbana-Champaign
Biskup, Darren	Columbia University
Halder, Udit	University of South Florida
Krishnan, Girish	University of Illinois Urbana Champaign
Chowdhary, Girish	University of Illinois at Urbana Champaign
Gazzola, Mattia	University of Illinois at Urbana-Champaign
Mehta, Prashant	University of Illinois Urbana-Champaign
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Software-Hardware Integration for Robot Systems Abstract: A neural network-based framework is developed and experimentally demonstrated for the problem of estimating the shape of a soft continuum arm (SCA) from noisy measurements of the pose at a finite number of locations along the length of the arm. The neural network takes as input these measurements and produces as output a finite-dimensional approximation of the strain, which is further used to reconstruct the infinite-dimensional smooth posture. This problem is important for various soft robotic applications. It is challenging due to the flexible aspects that lead to the infinite-dimensional reconstruction problem for the continuous posture and strains. Because of this, past solutions to this problem are computationally intensive. The proposed fast smooth reconstruction method is shown to be five orders of magnitude faster while having comparable accuracy. The framework is evaluated on two testbeds: a simulated octopus muscular arm and a physical BR2 pneumatic soft manipulator.


ThCT16 Regular Session, 404	Add to My Program
Grasping 4

Chair: Chakraborty, Nilanjan	Stony Brook University
Co-Chair: Harada, Kensuke	Osaka University

11:15-11:20, Paper ThCT16.1	Add to My Program
GraspSAM: When Segment Anything Model Meets Grasp Detection

Noh, Sangjun	Gwangju Institute of Science and Technology
Kim, Jong-Won	GIST(Gwangju Institute of Science and Technology)
Nam, Dongwoo	Gwangju Institute of Science and Technology
Back, Seunghyeok	Korea Institute of Machinery & Materials
Kang, Raeyoung	Gwangju Institute of Science and Technology
Lee, Kyoobin	Gwangju Institute of Science and Technology
Keywords: Deep Learning Methods, Grasping, Transfer Learning Abstract: Grasp detection requires flexibility to handle objects of various shapes without relying on prior object knowledge, while also offering intuitive, user-guided control. In this paper, we introduce GraspSAM, an innovative extension of the Segment Anything Model (SAM) designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale training data, GraspSAM leverages SAM’s large-scale training and prompt-based segmentation capabilities to efficiently support both target-object and category-agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine-tuning to integrate object segmentation and grasp prediction into a unified framework. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including Jacquard, Grasp-Anything, and Grasp-Anything++. Extensive experiments demonstrate GraspSAM’s flexibility in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real-world robotic applications. Robot demonstrations, additional results, and code can be found at https://gistailab.github.io/GraspSAM/.

11:20-11:25, Paper ThCT16.2	Add to My Program
Dexterous Ungrasping Manipulation in Three Dimensions

Kang, Taewoong	Pusan National University
Kim, Joonyoung	Pusan National University
Oh, Seung Hwa	Pusan National University
Lim, WooSung	Pusan National University
Lee, Junwoo	Pusan National University
Yi, Seung-Joon	Pusan National University
Seo, Jungwon	Pusan National University
Keywords: Dexterous Manipulation, Grasping, Assembly Abstract: This study focuses on the robotic capability of ungrasping, or releasing, an object in a grasp from the gripper to the robot’s environment. The presented technique enables the delicate release of a grasped object using non-static contacts, allowing for rolling and/or sliding. This dexterous manipulation capability is particularly relevant when ungrasping thin or slender objects, as will be demonstrated with real examples. We initially discuss the establishment of three-dimensional stability during ungrasping manipulation, ensuring robustness. Subsequently, we present a planning and control solution for three-dimensional ungrasping, building upon our previous planar version. A series of experiments across various test scenarios, ranging from precision placement to puzzle tiling, showcase the viability and effectiveness of our approach.

11:25-11:30, Paper ThCT16.3	Add to My Program
RTAGrasp: Learning Task-Oriented Grasping from Human Videos Via Retrieval, Transfer, and Alignment

Dong, Wenlong	Southern University of Science and Technology
Huang, Dehao	Southern University of Science and Technology
Liu, Jiangshan	Southern University of Science and Technology
Tang, Chao	Southern University of Science and Technology
Zhang, Hong	SUSTech
Keywords: Grasping, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation Abstract: Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at https://sites.google.com/view/rtagrasp/home.

11:30-11:35, Paper ThCT16.4	Add to My Program
You Only Estimate Once: Unified, One-Stage, Real-Time Category-Level Articulated Object 6D Pose Estimation for Robotic Grasping

Huang, Jingshun	Fudan University
Lin, Haitao	Tencent
Wang, Tianyu	Fudan University
Fu, Yanwei	Fudan University
Jiang, Yu-Gang	Fudan University
Xue, Xiangyang	Fudan University
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation Abstract: This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.

11:35-11:40, Paper ThCT16.5	Add to My Program
Point Cloud Decomposition for Task-Oriented Grasping

Phi, Khiem	Stony Brook University
Patankar, Aditya	Stony Brook University
Mahalingam, Dasharadhan	Stony Brook University
Chakraborty, Nilanjan	Stony Brook University
Ramakrishnan, Iv	Stony Brook University
Keywords: Grasping, Perception for Grasping and Manipulation Abstract: Accurate localization of graspable regions within a single object point cloud is critical to enable task-based robot grasps. State-of-the-art task-based robot grasp synthesis methods fits over-approximated 3D bounding boxes that fails to isolate graspable regions even if they exist. While deep learning or geometrical shape decomposition methods can offer improved approximations, they lack guarantees for the graspability of segmented regions, require prior knowledge of the object, and/or demand large annotated datasets for fine-tuning. In this paper, we overcome these limitations to introduce ITSI. ITSI is a complete, task-oriented grasp synthesis approach that functions independently of object-specific knowledge. ITSI (Iterative Slicing) effectively segments multiple graspable regions that conform to the constraints of robot grippers thereby enabling compatibility with any object a robot seeks to grasp and any robot gripper size. Our extensive real-world and simulation experiments on diverse object datasets demonstrates how ITSI dramatically increases the number of discoverable robot grasps by up to 44% when compared to the state-of-the-art. We also expand ITSI's capabilities beyond task-based robot grasp synthesis to highlight its performance in human affordance segmentation outperforming fully supervised deep learning methods by 1%.

11:40-11:45, Paper ThCT16.6	Add to My Program
Adaptive Grasping of Moving Objects in Dense Clutter Via Global-To-Local Detection and Static-To-Dynamic Planning

Chen, Hao	Osaka University
Kiyokawa, Takuya	Osaka University
Wan, Weiwei	Osaka University
Harada, Kensuke	Osaka University
Keywords: Grasping, Dexterous Manipulation, Planning under Uncertainty Abstract: Robotic grasping is facing a variety of real-world uncertainties caused by non-static object states, unknown object properties, and cluttered object arrangements. The difficulty of grasping increases with the presence of more uncertainties, where commonly used learning-based approaches struggle to perform stably across varying conditions. In this study, we extend the idea of using similarity matching to tackle the challenge of grasping novel objects that are simultaneously in motion and densely cluttered where multiple uncertainties coexist with a single in-hand camera. We achieve this difficult task by shifting visual detection from global to local states and operating grasp planning from static to dynamic states. We propose several methods and algorithms to optimize planning efficiency and accuracy. Our system is adaptive to different object types, arrangements and movement speeds without additional training, as proved by our real-world experiments.


ThCT17 Regular Session, 405	Add to My Program
Localization 6

Chair: Scherer, Sebastian	Carnegie Mellon University
Co-Chair: Costante, Gabriele	University of Perugia

11:15-11:20, Paper ThCT17.1	Add to My Program
UASTHN: Uncertainty-Aware Deep Homography Estimation for UAV Satellite-Thermal Geo-Localization

Xiao, Jiuhong	New York University
Loianno, Giuseppe	New York University
Keywords: Deep Learning for Visual Perception, Aerial Systems: Applications, Localization Abstract: Geo-localization is an essential component of Unmanned Aerial Vehicle (UAV) navigation systems to ensure precise absolute self-localization in outdoor environments. To address the challenges of GPS signal interruptions or low illumination, Thermal Geo-localization (TG) employs aerial thermal imagery to align with reference satellite maps to accurately determine the UAV's location. However, existing TG methods lack uncertainty measurement in their outputs, compromising system robustness in the presence of textureless or corrupted thermal images, self-similar or outdated satellite maps, geometric noises, or thermal images exceeding satellite maps. To overcome these limitations, this paper presents UASTHN, a novel approach for Uncertainty Estimation (UE) in Deep Homography Estimation (DHE) tasks for TG applications. Specifically, we introduce a novel Crop-based Test-Time Augmentation (CropTTA) strategy, which leverages the homography consensus of cropped image views to effectively measure data uncertainty. This approach is complemented by Deep Ensembles (DE) employed for model uncertainty, offering comparable performance with improved efficiency and seamless integration with any DHE model. Extensive experiments across multiple DHE models demonstrate the effectiveness and efficiency of CropTTA in TG applications. Analysis of detected failure cases underscores the improved reliability of CropTTA under challenging conditions. Finally, we demonstrate the capability of combining CropTTA and DE for a comprehensive assessment of both data and model uncertainty. Our research provides profound insights into the broader intersection of localization and uncertainty estimation. The code and models are publicly available.

11:20-11:25, Paper ThCT17.2	Add to My Program
Enhancing Feature Tracking Reliability for Visual Navigation Using Real-Time Safety Filter

Kim, Dabin	Seoul National University
Jang, Inkyu	Seoul National University
Han, Youngsoo	Seoul National University
Hwang, Sunwoo	Seoul National University
Kim, H. Jin	Seoul National University
Keywords: Sensor-based Control, Reactive and Sensor-Based Planning, View Planning for SLAM Abstract: Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot's overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot's kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.

11:25-11:30, Paper ThCT17.3	Add to My Program
SuperLoc: The Key to Robust LiDAR-Inertial Localization Lies in Predicting Alignment Risks

Zhao, Shibo	Carnegie Mellon University
Zhu, Honghao	CMU
Gao, Yuanjun	Carnegie Mellon University
Kim, Beomsoo	Hanyang University
Qiu, Yuheng	Carnegie Mellon University
Johnson, Aaron M.	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Keywords: Localization, SLAM, Mapping Abstract: Map-based LiDAR localization, while widely used in autonomous systems, faces significant challenges in degraded environments due to the lack of distinct geometric features. This paper introduces SuperLoc, a robust LiDAR localization package that addresses key limitations in existing methods. SuperLoc features a novel predictive alignment risk assessment technique, enabling early detection and mitigation of potential failures before optimization. This approach significantly improves performance in challenging scenarios such as corridors, tunnels, and caves. Unlike existing degeneracy mitigation algorithms that rely on post-optimization analysis and heuristic thresholds, SuperLoc evaluates the localizability of raw sensor measurements. Experimental results demonstrate significant performance improvements over state-of-the-art methods across various degraded environments. Our approach achieves a 49.7% increase in accuracy and exhibits the highest robustness. To facilitate further research, we release our implementation along with datasets from eight challenging scenarios.

11:30-11:35, Paper ThCT17.4	Add to My Program
Active Illumination for Visual Ego-Motion Estimation in the Dark

Crocetti, Francesco	University of Perugia
Dionigi, Alberto	University of Perugia
Brilli, Raffaele	University of Perugia
Costante, Gabriele	University of Perugia
Valigi, Paolo	Universita' Di Perugia
Keywords: Vision-Based Navigation, Perception-Action Coupling, Localization Abstract: Visual Odometry (VO) and Visual SLAM (V-SLAM) systems often struggle in low-light and dark environments due to the lack of robust visual features. In this paper, we propose a novel active illumination framework to enhance the performance of VO and V-SLAM algorithms in these challenging conditions. The developed approach dynamically controls a moving light source to illuminate highly textured areas, thereby improving feature extraction and tracking. Specifically, a detector block, which incorporates a deep learning-based enhancing network, identifies regions with relevant features. Then, a pan-tilt controller is responsible for guiding the light beam toward these areas, so that to provide information-rich images to the ego-motion estimation algorithm. Experimental results on a real robotic platform demonstrate the effectiveness of the proposed method, showing a reduction in the pose estimation error up to 75% with respect to a traditional fixed lighting technique.

11:35-11:40, Paper ThCT17.5	Add to My Program
Intensity Triangle Descriptor Constructed from High-Resolution Spinning LiDAR Intensity Image for Loop Closure Detection

Zhang, Yanfeng	Institute of Automation, Chinese Academy of Sciences
Tian, Yunong	Institute of Automation, Chinese Academy of Sciences
Yang, Guodong	Institute of Automation, Chinese Academy of Sciences; Beijing Zh
Li, Zhishuo	Chinese Academy of Sciences
Luo, Mingrui	Institute of Automation, Chinese Academy of Sciences
Li, En	Institute of Automation, Chinese Academy of Sciences
Jing, Fengshui	Institute of Automation, CAS
Keywords: SLAM, Recognition, Localization Abstract: LiDAR-based loop closure detection is a crucial part of realizing robust SLAM algorithms for intelligent vehicles with LiDAR sensors. Existing methods often reduce the keypoint dimension to encode the global descriptor, which sacrifices the freedom of loop detection and correction. Based on the 6-DOF rigid transformation property of spatial triangles, we propose an algorithm for extracting and describing 3D keypoints from high-resolution spinning LiDAR intensity images to encode triangle descriptors, termed intensity triangle descriptor (ITD). In comparison to the direct extraction of keypoints from the point cloud, the use of image-derived feature points provides additional photometric texture information and better handles uneven spatial density of the point cloud, which is advantageous in unstructured and geometrically degraded scenes. To enhance the stability of keypoints, the spatial positions of multi-frame image feature points are registered to a keyframe by an odometer for voxel downsampling and non-maximum suppression, with the objective of reducing unstable feature points. For high discrimination, the neighbor image patches of each vertex (keypoint) are aggregated to estimate a Gaussian mixture model (GMM) as the keypoint signature. An efficient two-stage loop closure detection method is then proposed for ITD, consisting of candidate retrieval based on triangle side lengths and vertex GMMs, followed by geometric verification of matched descriptor pairs. The effectiveness of the proposed method is evaluated on the STheReO, FusionPortable, and our self-collected datasets.

11:40-11:45, Paper ThCT17.6	Add to My Program
IBTC: An Image-Assisting Binary and Triangle Combined Descriptor for Place Recognition by Fusing LiDAR and Camera Measurements

Zou, Zuhao	HongKong University
Zheng, Chunran	The University of Hong Kong
Yuan, Chongjian	The University of Hong Kong
Zhou, Shunbo	Huawei
Xue, Kaiwen	The Chinese University of Hong Kong, Shenzhen
Zhang, Fu	University of Hong Kong
Keywords: SLAM, Localization, Mapping Abstract: In this work, we introduce a novel multimodal descriptor, the image-assisting binary, and triangle combined (iBTC) descriptor, which fuses LiDAR (Light Detection and Ranging) and camera measurements for 3D place recognition. The inherent invariance of a triangle to rigid transformations inspires us to design triangle-based descriptors. We first extract distinct 3D key points from both LiDAR and camera measurements and organize them into triplets to form triangles. By utilizing the lengths of the sides of these triangles, we can create triangle descriptors, enabling the rapid retrieval of similar triangles from a database. By encoding the geometric and visual details at the triangle vertices into binary descriptors, we augment the triangle descriptors with richer local information. This enrichment process empowers our descriptors to reject mismatched triangle pairs. Consequently, the remaining matched triangle pairs yield accurate loop closure place indices and relative poses. In our experiments, we conduct a thorough comparison of our proposed method with several SOTA methods across public and self-collected datasets. The results demonstrate that our method exhibits superior performance in place recognition and overcomes the limitations associated with unimodal methods like BTC, RING++, ORB-DBoW2, and NetVLAD. Additionally, we performed a time cost benchmark experiment and the result indicates that our method’s time consumption is reasonable, compared with baseline methods. A demonstration video is available on https://www.youtube.com/watch?v=fe1Q0eR2fWk.


ThCT18 Regular Session, 406	Add to My Program
Planning under Uncertainty 2

Chair: Yamane, Katsu	Path Robotics Inc
Co-Chair: Gammell, Jonathan	Queen's University

11:15-11:20, Paper ThCT18.1	Add to My Program
Belief Roadmaps with Uncertain Landmark Evanescence

Fuentes, Erick	Massachusetts Institute of Technology
Strader, Jared	Massachusetts Institute of Technology
Fahnestock, Ethan	MIT
Roy, Nicholas	Massachusetts Institute of Technology
Keywords: Planning under Uncertainty Abstract: We would like a robot to navigate to a goal location while minimizing state uncertainty. To aid the robot in this endeavor, maps provide a prior belief over the location of objects and regions of interest. To localize itself within the map, a robot identifies mapped landmarks using its sensors. However, as the time between map creation and robot deployment increases, portions of the map can become stale, and landmarks, once believed to be permanent, may disappear. We refer to the propensity of a landmark to disappear as landmark evanescence. Reasoning about landmark evanescence during path planning, and the associated impact on localization accuracy, requires analyzing the presence or absence of each landmark, leading to an exponential number of possible outcomes of a given motion plan. To address this complexity, we develop BRULE, an extension of the Belief Roadmap. During planning, we replace the belief over future robot poses with a Gaussian mixture which is able to capture the effects of landmark evanescence. Furthermore, we show that belief updates can be made efficient, and that maintaining a random subset of mixture components is sufficient to find high quality solutions. We demonstrate performance in simulated and real-world experiments. Software is available at https://bit.ly/BRULE.

11:20-11:25, Paper ThCT18.2	Add to My Program
Safe and Efficient Path Planning under Uncertainty Via Deep Collision Probability Fields

Herrmann, Felix	Technische Universität Darmstadt
Zach, Sebastian Bernhard	Technische Universität Darmstadt
Banfi, Jacopo	Amazon
Peters, Jan	Technische Universität Darmstadt
Chalvatzaki, Georgia	Technische Universität Darmstadt
Tateo, Davide	Technische Universität Darmstadt
Keywords: Planning under Uncertainty, Deep Learning Methods, Motion and Path Planning Abstract: Estimating collision probabilities between robots and environmental obstacles or other moving agents is crucial to ensure safety during path planning. This is an important building block of modern planning algorithms in many application scenarios such as autonomous driving, where noisy sensors perceive obstacles. While many approaches exist, they either provide too conservative estimates of the collision probabilities or are computationally intensive due to their sampling-based nature. To deal with these issues, we introduce Deep Collision Probability Fields, a neural-based approach for computing collision probabilities of arbitrary objects with arbitrary unimodal uncertainty distributions. Our approach relegates the computationally intensive estimation of collision probabilities via sampling at the training step, allowing for fast neural network inference of the constraints during planning. In extensive experiments, we show that Deep Collision Probability Fields can produce reasonably accurate collision probabilities (up to 10^{-3}) for planning and that our approach can be easily plugged into standard path planning approaches to plan safe paths on 2-D maps containing uncertain static and dynamic obstacles. Additional material, code, and videos are available at https://sites.google.com/view/ral-dcpf.

11:25-11:30, Paper ThCT18.3	Add to My Program
Safe POMDP Online Planning among Dynamic Agents Via Adaptive Conformal Prediction

Sheng, Shili	University of Virginia
Yu, Pian	University College London
Parker, David	University of Oxford
Kwiatkowska, Marta	University of Oxford
Feng, Lu	University of Virginia
Keywords: Formal Methods in Robotics and Automation, Planning under Uncertainty, Collision Avoidance Abstract: Online planning for partially observable Markov decision processes (POMDPs) provides efficient techniques for robot decision-making under uncertainty. However, existing methods fall short of preventing safety violations in dynamic environments. This work presents a novel safe POMDP online planning approach that maximizes expected returns while providing probabilistic safety guarantees amidst environments populated by multiple dynamic agents. Our approach utilizes data-driven trajectory prediction models of dynamic agents and applies Adaptive Conformal Prediction (ACP) to quantify the uncertainties in these predictions. Leveraging the obtained ACP-based trajectory predictions, our approach constructs safety shields on-the-fly to prevent unsafe actions within POMDP online planning. Through experimental evaluation in various dynamic environments using real-world pedestrian trajectory data, the proposed approach has been shown to effectively maintain probabilistic safety guarantees while accommodating up to hundreds of dynamic agents.

11:30-11:35, Paper ThCT18.4	Add to My Program
Rao-Blackwellized POMDP Planning

Lee, Jiho	University of Colorado Boulder
Ahmed, Nisar	University of Colorado Boulder
Wray, Kyle	N/a
Sunberg, Zachary	University of Colorado
Keywords: Planning under Uncertainty, Reinforcement Learning, Probabilistic Inference Abstract: Partially Observable Markov Decision Processes (POMDPs) provide a structured framework for decision-making under uncertainty, but their application requires efficient belief updates. Sequential Importance Resampling Particle Filters (SIRPF), also known as Bootstrap Particle Filters, are commonly used as belief updaters in large approximate POMDP solvers, but they face challenges such as particle deprivation and high computational costs as the system's state dimension grows. To address these issues, this study introduces Rao-Blackwellized POMDP (RB-POMDP) approximate solvers and outlines generic methods to apply Rao-Blackwellization in both belief updates and online planning. We compare the performance of SIRPF and Rao-Blackwellized Particle Filters (RBPF) in a simulated localization problem where an agent navigates toward a target in a GPS-denied environment using POMCPOW and RB-POMCPOW planners. Our results not only confirm that RBPFs maintain efficient belief approximations over time with fewer particles, but, more surprisingly, RBPFs combined with quadrature-based integration improves planning quality significantly compared to SIRPF-based planning under the same computational limits.

11:35-11:40, Paper ThCT18.5	Add to My Program
Nearest-Neighbourless Asymptotically Optimal Motion Planning with Fully Connected Informed Trees (FCIT*)

Wilson, Tyler S.	Queen's University
Thomason, Wil	The AI Institute
Kingston, Zachary	Purdue University
Kavraki, Lydia	Rice University
Gammell, Jonathan	Queen's University
Keywords: Motion and Path Planning, Manipulation Planning, Constrained Motion Planning Abstract: Improving the performance of motion planning algorithms for high-degree-of-freedom robots usually requires reducing the cost or frequency of computationally expensive operations. Traditionally, and especially for asymptotically optimal sampling-based motion planners, the most expensive operations are local motion validation and querying the nearest neighbours of a configuration. Recent advances have significantly reduced the cost of motion validation by using single instruction/multiple data (SIMD) parallelism to improve solution times for satisficing motion planning problems. These advances have not yet been applied to asymptotically optimal motion planning. This paper presents Fully Connected Informed Trees (FCIT), the first fully connected, informed, anytime almost-surely asymptotically optimal (ASAO) algorithm. FCIT exploits the radically reduced cost of edge evaluation via SIMD parallelism to build and search fully connected graphs. This removes the need for nearest-neighbours structures, which are a dominant cost for many sampling-based motion planners, and allows it to find initial solutions faster than state-of-the-art ASAO (VAMP, OMPL) and satisficing (OMPL) algorithms on the MotionBenchMaker dataset while converging towards optimal plans in an anytime manner.

11:40-11:45, Paper ThCT18.6	Add to My Program
Efficient Path Planning in Complex Environments with Trust Region Continuous Belief Tree Search

Nunez, Andre Julio	University of Technology Sydney
Kong, Felix Honglim	The University of Technology Sydney
González-Cantos, Alberto	Navantia
Fitch, Robert	University of Technology Sydney
Keywords: Constrained Motion Planning, Motion and Path Planning, Marine Robotics Abstract: Real-world applications of path planning must contend with complicated constraint and objective functions imposed by the surrounding operational and regulatory environment. Traditional methods such as PRM* and RRT* have asymptotic guarantees, but often struggle in practice with complex black-box objective/constraint functions, especially in compute-limited situations. Continuous Belief Tree Search (CBTS) addresses these limitations by maintaining local estimates of the objective function in order to sample new nodes from continuous space, often giving high-quality solutions more quickly. However, CBTS requires careful tuning of a control duration parameter, which introduces a tradeoff between compute time and path cost/feasibility. In environments with complex costs and constraints, there may be no single control duration that gives good paths in short compute time. This paper proposes Trust Region CBTS (TR-CBTS), an extension of CBTS with an adaptive control duration parameter inspired by trust region methods. TR-CBTS adjusts control duration based on information from recently sampled candidate nodes, allowing longer control duration where possible to speed up compute time, and shortening control duration when precise navigation in environments with complex, unknown constraint and objective functions. We show TR-CBTS outperforms existing comparable planners for a realistic robotic path planning application in autonomous ship routing.


ThCT19 Regular Session, 407	Add to My Program
Active Perception

Chair: Bezzo, Nicola	University of Virginia
Co-Chair: Lopez, Brett	University of California, Los Angeles

11:15-11:20, Paper ThCT19.1	Add to My Program
PRIMER: Perception-Aware Robust Learning-Based Multiagent Trajectory Planner

Kondo, Kota	Massachusetts Institute of Technology
Tewari, Claudius Taroon	Massachusetts Institute of Technology
Tagliabue, Andrea	Massachusetts Institute of Technology
Tordesillas Torres, Jesus	ICAI School of Engineering, Comillas Pontifical University
Lusk, Parker C.	Massachusetts Institute of Technology
Peterson, Mason B.	Massachusetts Institute of Technology
How, Jonathan	Massachusetts Institute of Technology
Keywords: Path Planning for Multiple Mobile Robots or Agents, Imitation Learning, Aerial Systems: Applications Abstract: In decentralized multiagent trajectory planners, agents need to communicate and exchange their positions to generate collision-free trajectories. However, due to localization errors/uncertainties, trajectory deconfliction can fail even if trajectories are perfectly shared between agents. To address this issue, we first present PARM and PARM, perception-aware, decentralized, asynchronous multiagent trajectory planners that enable a team of agents to navigate uncertain environments while deconflicting trajectories and avoiding obstacles using perception information. PARM differs from PARM as it is less conservative, using more computation to find closer-to-optimal solutions. While these methods achieve state-of-the-art performance, they suffer from high computational costs as they need to solve large optimization problems onboard, making it difficult for agents to replan at high rates. To overcome this challenge, we present our second key contribution, PRIMER, a learning-based planner trained with imitation learning (IL) using PARM* as the expert demonstrator. PRIMER leverages the low computational requirements at deployment of neural networks and achieves a computation speed up to 5614 times faster than optimization-based approaches.

11:20-11:25, Paper ThCT19.2	Add to My Program
HGS-Planner: Hierarchical Planning Framework for Active Scene Reconstruction Using 3D Gaussian Splatting

Xu, Zijun	Fudan University
Jin, Rui	Zhejiang University
Wu, Ke	Fudan University
Zhao, Yi	Fudan University
Zhang, Zhiwei	Fudan University
Zhao, Jieru	Shanghai Jiao Tong University
Gao, Fei	Zhejiang University
Gan, Zhongxue	Fudan University
Ding, Wenchao	Fudan University
Keywords: View Planning for SLAM, Deep Learning for Visual Perception Abstract: In complex missions such as search and rescue, robots must make intelligent decisions in unknown environments, relying on their ability to perceive and understand their surroundings. High-quality and real-time reconstruction enhances situational awareness and is crucial for intelligent robotics. Traditional methods often struggle with poor scene representation or are too slow for real-time use. Inspired by the efficacy of 3D Gaussian Splatting (3DGS), we propose a hierarchical planning framework for fast and high-fidelity active reconstruction. Our method evaluates completion and quality gain to adaptively guide reconstruction, integrating global and local planning for efficiency. Experiments in simulated and real-world environments show our approach outperforms existing real-time methods.

11:25-11:30, Paper ThCT19.3	Add to My Program
An Active Perception Game for Robust Information Gathering

He, Siming	University of Pennsylvania
Tao, Yuezhan	University of Pennsylvania
Spasojevic, Igor	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Chaudhari, Pratik	University of Pennsylvania
Keywords: Mapping, Probability and Statistical Methods, Vision-Based Navigation Abstract: Active perception approaches select future viewpoints by using some estimate of the information gain. An inaccurate estimate can be detrimental in critical situations, e.g., locating a person in distress. However the true information gained can only be calculated post hoc, i.e., after the observation is realized. We present an approach to estimate the discrepancy between the estimated information gain (which is the expectation over putative future observations while neglecting correlations among them) and the true information gain. The key idea is to analyze the mathematical relationship between active perception and the estimation error of the information gain in a game-theoretic setting. Using this, we develop an online estimation approach that achieves sub-linear regret (in the number of time-steps) for the estimation of the true information gain and reduces the sub-optimality of active perception systems. We demonstrate our approach for active perception using a comprehensive set of experiments on: (a) different types of environments, including a quadrotor in a photorealistic simulation, real-world robotic data, and real-world experiments with ground robots exploring indoor and outdoor scenes; (b) different types of robotic perception data; and (c) different map representations. On average, our approach reduces information gain estimation errors by 42%, increases the information gain by 7%, PSNR by 5%, and semantic accuracy (measured as the number of objects that are localized correctly) by 6%. In real-world experiments with a Jackal ground robot, our approach demonstrated complex trajectories to explore occluded regions.

11:30-11:35, Paper ThCT19.4	Add to My Program
Take Your Best Shot: Sampling-Based Planning for Autonomous Photography

Gao, Shijie	University of Virginia
Bramblett, Lauren	University of Virginia
Bezzo, Nicola	University of Virginia
Keywords: Vision-Based Navigation, Planning under Uncertainty, Reactive and Sensor-Based Planning Abstract: Autonomous mobile robots (AMRs) equipped with high-quality cameras are revolutionizing the field of autonomous photography by delivering efficient and cost-effective methods for capturing dynamic visual content. As AMRs are deployed in increasingly diverse environments, the challenge of consistently producing high-quality photographic content remains. Traditional approaches often involve AMRs following a predetermined path while capturing data-intensive imagery, which can be suboptimal, especially in environments with limited connectivity or physical obstructions. These drawbacks necessitate intelligent decision-making to pinpoint optimal vantage points for image capture. Inspired by Next Best View studies, we propose a novel autonomous photography framework that enhances image quality and minimizes the number of photos needed. This framework incorporates a proposed evaluation metric that leverages ray-tracing and Gaussian process interpolation, enabling the assessment of potential visual information from the target in partially known environments. A derivative-free optimization (DFO) method is then proposed to sample candidate views and identify the optimal viewpoint. The effectiveness of our approach is demonstrated by comparing it with existing methods and further validated through simulations and experiments with various vehicles.

11:35-11:40, Paper ThCT19.5	Add to My Program
An Addendum to NeBula: Toward Extending Team CoSTAR’s Solution to Larger Scale Environments (I)

Morrell, Benjamin	Jet Propulsion Laboratory, California Institute of Technology
Otsu, Kyohei	California Institute of Technology
Agha-mohammadi, Ali-akbar	NASA-JPL, Caltech
Fan, David D	NASA Jet Propulsion Laboratory
Kim, Sung-Kyun	NASA Jet Propulsion Laboratory, Caltech
Ginting, Muhammad Fadhil	Stanford University
Lei, Xianmei	NASA JPL
Edlund, Jeffrey	Jet Propulsion Lab
Fakoorian, Seyed Abolfazl	Cleveland State University
Bouman, Amanda	Caltech
Chavez, Fernando	Jet Propulsion Laboratory
Kim, Taeyeon	Korea Advanced Institute of Science and Technology
Correa, Gustavo J.	University of California Riverside
Saboia Da Silva, Maira	NASA Jet Propulsion Laboratory
Santamaria-Navarro, Angel	Universitat Politècnica De Catalunya
Lopez, Brett	University of California, Los Angeles
Kim, Boseong	Korea Advanced Institute of Science and Technology (KAIST)
Jung, Chanyoung	KAIST
Sobue, Mamoru	The University of Tokyo
Peltzer, Oriana	Stanford University
Ott, Joshua	Stanford University
Trybula, Robert	University of Southern California
Touma, Thomas	Caltech
Kaufmann, Marcel	Polytechnique Montreal
Vaquero, Tiago	JPL, Caltech
Pailevanian, Torkom	Jet Propulsion Laboratory
Palieri, Matteo	NASA Jet Propulsion Laboratory
Chang, Yun	MIT
Reinke, Andrzej	University of Bonn
Spieler, Patrick	JPL
Clark, Lillian	University of Southern California
Archanian, Avak	Jet Propulsion Laboratory, California Institute of Technology
Chen, Kenny	University of California, Los Angeles
Melikyan, Hovhannes	Jet Propulsion Laboratory, California Institute of Technology
Dixit, Anushri	University of California, Los Angeles
Delecki, Harrison	Stanford University
Pastor, Daniel	Caltech
Ridge, Barry	NASA Jet Propulsion Laboratory, California Institute of Technolo
Marchal, Nicolas Paul	ETH Zurich
Uribe, Jose	Jet Propulsion Laboratory
Kochenderfer, Mykel	Stanford University
Beltrame, Giovanni	Ecole Polytechnique De Montreal
Nikolakopoulos, George	Luleå University of Technology
Shim, David Hyunchul	KAIST
Carlone, Luca	Massachusetts Institute of Technology
Burdick, Joel	California Institute of Technology
Keywords: Field Robots, Multi-Robot Systems, Software-Hardware Integration for Robot Systems Abstract: This article presents an appendix to the original NeBula autonomy solution developed by the Team Collaborative SubTerranean Autonomous Robots (CoSTAR), participating in the DARPA Subterranean Challenge. Specifically, this article presents extensions to NeBula’s hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: 1) large-scale geometric and semantic environment mapping; 2) an adaptive positioning system; 3) probabilistic traversability analysis and local planning; 4) large-scale partially observable Markov decision process (POMDP)-based global motion planning and exploration behavior; 5) large-scale networking and decentralized reasoning; 6) communication-aware mission planning; and 7) multimodal ground–aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.

11:40-11:45, Paper ThCT19.6	Add to My Program
InstanceVO: Self-Supervised Semantic Visual Odometry by Using Metric Learning to Incorporate Geometrical Priors in Instance Objects

Xie, Yuanyan	Tsinghua University
Yang, Junzhe	University of Science and Technology Beijing
Zhou, Huaidong	Tsinghua University
Sun, Fuchun	Tsinghua University
Keywords: Localization, Semantic Scene Understanding, Autonomous Agents Abstract: Visual odometry is one of the key technologies for unmanned ground vehicles. To improve the robustness of the systems and enable intelligent tasks, researchers introduced learning-based recognition modules into visual odometry systems, but didn't realize tight coupling between visual odometry systems and recognition modules. This paper proposes a self-supervised semantic visual odometry method, which can complete the tasks of ego-motion estimation, depth prediction, and instance segmentation with a shared encoder. The potential dynamic regions are removed and the image reconstruction loss is rectified by instance detection results. Moreover, the instance-guided triplet loss and cross-task self-attention modules are devised to learn the geometrical relationships among pixels that are implied in instance object priors. The proposed method is validated on KITTI and ComplexUrban datasets. The experimental results show that our method has superiority to baseline models in both pose estimation and depth prediction. We also discuss the efficacy of evaluation metrics for pose estimation, and consider the accumulation errors of trajectories.


ThCT20 Regular Session, 408	Add to My Program
In-Hand Manipulation

Chair: Mason, Matthew T.	Carnegie Mellon University
Co-Chair: Iba, Soshi	Honda Research Institute USA

11:15-11:20, Paper ThCT20.1	Add to My Program
GET-Zero: Graph Embodiment Transformer for Zero-Shot Embodiment Generalization

Patel, Austin	Stanford University
Song, Shuran	Stanford University
Keywords: Transfer Learning, Dexterous Manipulation, Multifingered Hands Abstract: This paper introduces GET-Zero, a model architecture and training procedure for learning an embodiment-aware control policy that can immediately adapt to new hardware changes without retraining. To do so, we present Graph Embodiment Transformer (GET), a transformer model that leverages the embodiment graph connectivity as a learned structural bias in the attention mechanism. We use behavior cloning to distill demonstration data from embodiment-specific expert policies into an embodiment-aware GET model that conditions on the hardware configuration of the robot to make control decisions. We conduct a case study on a dexterous in-hand object rotation task using different configurations of a four-fingered robot hand with joints removed and with link length extensions. Using the GET model along with a self-modeling loss enables GET-Zero to zero-shot generalize to unseen variation in graph structure and link length, yielding a 20% improvement over baseline methods. All code and qualitative video results are on our project website https://get-zero-paper.github.io.

11:20-11:25, Paper ThCT20.2	Add to My Program
Proprioceptive Object Shape and Size Extraction Via In-Hand-Manipulation with a Variable Friction Robot Gripper

Bodnar, Igor	Imperial College London
Spiers, Adam	Imperial College London
Keywords: In-Hand Manipulation, Grippers and Other End-Effectors, Force and Tactile Sensing Abstract: Robotic manipulation tasks commonly rely on computer vision or tactile sensing to extract the physical characteristics of an object. However, this additional sensing capability adds complexity and financial cost to a robotic system. Our work investigates the inexpensive alternative of feature extraction via proprioceptive sensing. Our goal is to determine whether proprioceptive data combined with in-hand-manipulation provides sufficient information to enable geometric reconstruction of object profiles. We use a newly designed 3-DOF robotic gripper with variable-friction finger surfaces to perform model-free in-hand-anipulation on a set of test objects comprised of two dimensional convex prisms. We have devised a manipulation sequence based on the rotation and sliding of test objects to allow side-counting with the successful measurement of shapes and sizes with average angle and size errors of 1.64% and 6.76% respectively. In addition, we have outlined potential research directions aimed at resolving inherent limitations of proprioceptive approaches and making our algorithm generalisable to any arbitrary shape.

11:25-11:30, Paper ThCT20.3	Add to My Program
Diffusion-Informed Probabilistic Contact Search for Multi-Finger Manipulation

Kumar, Abhinav	University of Michigan
Power, Thomas	Robotics Institute, University of Michigan
Yang, Fan	University of Michigan
Aguilera, Sergio	Georgia Institute of Technology
Iba, Soshi	Honda Research Institute USA
Soltani Zarrin, Rana	Honda Research Institute - USA
Berenson, Dmitry	University of Michigan
Keywords: Dexterous Manipulation, Manipulation Planning, Deep Learning in Grasping and Manipulation Abstract: Planning contact-rich interactions for multi-finger manipulation is challenging due to the high-dimensionality and hybrid nature of dynamics. Recent advances in data-driven methods have shown promise, but are sensitive to the quality of training data. Combining learning with classical methods like trajectory optimization and search adds additional structure to the problem and domain knowledge in the form of constraints, which can lead to outperforming the data on which models are trained. We present Diffusion-Informed Probabilistic Contact Search (DIPS), which uses an A* search to plan a sequence of contact modes informed by a diffusion model. We train the diffusion model on a dataset of demonstrations consisting of contact modes and trajectories generated by a trajectory optimizer given those modes. In addition, we use a particle filter-inspired method to reason about variability in diffusion sampling arising from model error, estimating likelihoods of trajectories using a learned discriminator. We show that our method outperforms ablations that do not reason about variability and can plan contact sequences that outperform those found in training data across multiple tasks. We evaluate on simulated tabletop card sliding and screwdriver turning tasks, as well as the screwdriver task in hardware to show that our combined learning and planning approach transfers to the real world.

11:30-11:35, Paper ThCT20.4	Add to My Program
Variable-Friction In-Hand Manipulation for Arbitrary Objects Via Diffusion-Based Imitation Learning

Yan, Qiyang	Imperial College London
Ding, Zihan	Princeton University
Zhou, Xin	Imperial College London
Spiers, Adam	Imperial College London
Keywords: In-Hand Manipulation, Imitation Learning, Machine Learning for Robot Control Abstract: Dexterous in-hand manipulation (IHM) for arbitrary objects is challenging due to the rich and subtle contact process. Variable-friction manipulation is an alternative approach to dexterity, previously demonstrating robust and versatile 2D IHM capabilities with only two single-joint fingers. However, the hard-coded manipulation methods for variable friction hands are restricted to regular polygon objects and limited target poses, as well as requiring the policy to be tailored for each object. This paper proposes an end-to-end learning-based manipulation method to achieve arbitrary object manipulation for any target pose on real hardware, with minimal engineering efforts and data collection. The method features a diffusion policy-based imitation learning method with co-training from simulation and a small amount of real-world data. With the proposed framework, arbitrary objects including polygons and non-polygons can be precisely manipulated to reach arbitrary goal poses within 2 hours of training on an A100 GPU and only 1 hour of real-world data collection. The precision is higher than previous customized object-specific policies, achieving an average success rate of 71.3% with average pose error being 2.676 mm and 1.902°. Code and videos can be found at: https://sites.google.com/view/vf-ihm-il/home.

11:35-11:40, Paper ThCT20.5	Add to My Program
From Simple to Complex Skills: The Case of In-Hand Object Reorientation

Qi, Haozhi	UC Berkeley
Yi, Brent	University of California, Berkeley
Lambeta, Mike Maroje	Facebook
Ma, Yi	University of Illinois at Urbana-Champaign
Calandra, Roberto	TU Dresden
Malik, Jitendra	UC Berkeley
Keywords: In-Hand Manipulation, Dexterous Manipulation, Reinforcement Learning Abstract: Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object's pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.

11:40-11:45, Paper ThCT20.6	Add to My Program
DROP: Dextereous Reorientation Via Online Planning

Li, Albert H.	California Institute of Technology
Culbertson, Preston	Cornell University
Kurtz, Vincent	California Institute of Technology
Ames, Aaron	California Institute of Technology
Keywords: In-Hand Manipulation, Dexterous Manipulation, Manipulation Planning Abstract: Achieving human-like dexterity is a longstanding challenge in robotics, in part due to the complexity of planning and control for contact-rich systems. In reinforcement learning (RL), one popular approach has been to use massively-parallelized, domain-randomized simulations to learn a policy offline over a vast array of contact conditions, allowing robust sim-to-real transfer. Inspired by recent advances in real-time parallel simulation, this work considers instead the viability of online planning methods for contact-rich manipulation by studying the well-known in-hand cube reorientation task. We propose a simple architecture that employs a sampling-based predictive controller and vision-based pose estimator to search for contact-rich control actions online. We conduct thorough experiments to assess the real-world performance of our method, architectural design choices, and key factors for robustness, demonstrating that our simple sampled-based approach achieves performance comparable to prior RL-based works. Supplemental material: https://caltech-amber.github.io/drop.


ThCT21 Regular Session, 410	Add to My Program
Safety and Control in HRI

Chair: He, Hongsheng	The University of Alabama
Co-Chair: Kim, Wansoo	Hanyang University ERICA

11:15-11:20, Paper ThCT21.1	Add to My Program
Uncertainty-Aware Probabilistic 3D Human Motion Forecasting Via Invertible Networks

Ma, Yue	Beihang University
Zhou, Kanglei	Beihang University
Yu, Fuyang	Beihang University
Li, Frederick W. B.	University of Durham
Xiaohui, Liang	State Key Laboratory of Virtual Reality Technology and Systems,
Keywords: Human and Humanoid Motion Analysis and Synthesis, Modeling and Simulating Humans, Safety in HRI Abstract: 3D human motion forecasting aims to enable autonomous applications. Estimating uncertainty for each prediction (i.e., confidence based on probability density or quantile) is essential for safety-critical contexts like human-robot collaboration to minimize risks. However, existing diverse motion forecasting approaches struggle with uncertainty quantification due to implicit probabilistic representations hindering uncertainty modeling. We propose ProbHMI, which introduces invertible networks to parameterize poses in a disentangled latent space, enabling probabilistic dynamics modeling. A forecasting module then explicitly predicts future latent distributions, allowing effective uncertainty quantification. Evaluated on benchmarks, ProbHMI achieves strong performance for both deterministic and diverse prediction while validating uncertainty calibration, critical for risk-aware decision making.

11:20-11:25, Paper ThCT21.2	Add to My Program
MonLog: MONotonic-Constrained LOGistic Regressions for Automated Safety Curve Design

Melone, Alessandro	Technical University of Munich
Kirschner, Robin Jeanne	TU Munich, Institute for Robotics and Systems Intelligence
Müller, Dirk	Department of Orthopaedics and Sports Orthopaedics, Klinikum Rec
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Safety in HRI, Physical Human-Robot Interaction, Human-Centered Robotics Abstract: The increasing integration of robots in close human environments necessitates robust safety measures that can adapt to evolving tasks and conditions. Current standards rely on task-specific safety evaluations that are often inflexible, requiring repeated assessments whenever task parameters change. This work proposes MonLog, a data-driven, probabilistic method to automatically derive safety curves (SCs) from recent injury protection data sets. By leveraging non-linear modeling techniques, our approach addresses the limitations of conventional linear SCs, which often result in overly conservative speed restrictions. We present a comprehensive test routine to validate our method, highlighting improvements in both compliance with safety constraints and operational efficiency. Our findings demonstrate that the proposed approach not only enhances safety but also optimizes robotic performance, making it suitable for a wide range of applications.

11:25-11:30, Paper ThCT21.3	Add to My Program
Passivity Filters for Bilateral Teleoperation with Variable Impedance Control

Alyousef Almasalmah, Fadi	University of Strasbourg
Poignonec, Thibault	University of Strasbourg, Icube Laboratory
Omran, Hassan	ICube Laboratory, University of Strasbourg, Strasbourg
Liu, Chao	LIRMM
Bayle, Bernard	University of Strasbourg
Keywords: Telerobotics and Teleoperation, Compliance and Impedance Control, Safety in HRI Abstract: In robotic teleoperation, it is crucial to be able to dynamically adjust interactions with the environment. Drawing inspiration from human behavior during interactions, Variable Impedance Control (VIC) has been widely adopted to enhance robotic flexibility and adaptability. However, maintaining the passivity of such control systems remains a critical safety concern. This paper introduces an optimization-based framework for passive variable impedance control in bilateral teleoperation, combining the advantages of Passivity Filters (PFs), Time-Domain Passivity (TDP) control, and Passive-Set-Position-Modulation (PSPM). The method solves an optimization problem aimed at dissipating the energy that could lead to a lack of passivity. The proposed method is assessed through experiments, illustrating its ability to keep the teleoperation system passive and safe under a variable impedance profile.

11:30-11:35, Paper ThCT21.4	Add to My Program
Robots That Learn to Safely Influence Via Prediction-Informed Reach-Avoid Dynamic Games

Pandya, Ravi	Carnegie Mellon University
Liu, Changliu	Carnegie Mellon University
Bajcsy, Andrea	Carnegie Mellon University
Keywords: Human-Robot Collaboration, Safety in HRI, Robot Safety Abstract: Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the physical safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.

11:35-11:40, Paper ThCT21.5	Add to My Program
Multi-Layered Safety of Redundant Robot Manipulators Via Task-Oriented Planning and Control

Jia, Xinyu	National University of Singapore
Wang, Wenxin	National University of Singapore
Yang, Jun	National University of Singapore
Pan, Yongping	Peng Cheng Laboratory
Yu, Haoyong	National University of Singapore
Keywords: Safety in HRI, Collision Avoidance, Motion Control Abstract: Ensuring safety is crucial to promote the application of robot manipulators in open workspaces. Factors such as sensor errors or unpredictable collisions make the environment full of uncertainties. In this work, we investigate these potential safety challenges on redundant robot manipulators, and propose a task-oriented planning and control framework to achieve multi-layered safety while maintaining efficient task execution. Our approach consists of two main parts: a task-oriented trajectory planner based on multiple-shooting model predictive control (MPC) method, and a torque controller that allows safe and efficient collision reaction using only proprioceptive data. Through extensive simulations and real-hardware experiments, we demonstrate that the proposed framework can effectively handle uncertain static or dynamic obstacles, and perform disturbance resistance in manipulation tasks when unforeseen contacts occur.

11:40-11:45, Paper ThCT21.6	Add to My Program
A Multi-Task Energy-Aware Impedance Controller for Enhanced Safety in Physical Human-Robot Interaction

Choi, SeungMin	Hanyang University
Ha, Seongmin	Hanyang University
Kim, Wansoo	Hanyang University ERICA
Keywords: Safety in HRI, Physical Human-Robot Interaction, Human-Robot Collaboration Abstract: In physical human-robot interaction (pHRI), ensuring human safety in all tasks conducted by the robot is crucial. Traditional compliance control strategies, such as admittance and impedance control, often lead to unpredictable robot behavior due to incidents like contact loss or unexpected external forces, which can cause significant harm to humans. To overcome these limitations, this study introduces a multi-task energy-aware impedance controller for kinematically redundant robots. This controller extends the energy-aware impedance control strategy, which ensures the passivity and safety of a single task using a virtual global energy tank, to kinematically redundant robots performing multiple tasks. The proposed controller effectively regulates the power flow of all tasks performed by the robot through a single global energy tank, ensuring the safety and passivity of the tasks. Experimental results in a shared environment, where external forces are simultaneously applied to the end-effector and the third joint of the Franka Emika Panda, showed that the robot's energy and power, as well as the power of all tasks, consistently remained within predefined thresholds. Additionally, when comparing the proposed controllers with controller that do not consider null space projection in the power regulation stage and controller that do not regulate the robot's power, our approach effectively managed the robot's energy and power and the power of all tasks, ensuring passivity and enhanced safety.


ThCT22 Regular Session, 411	Add to My Program
Learning for Manipulation

Chair: Zambelli, Martina	Google DeepMind
Co-Chair: Meißner, Pascal	Wuerzburg-Schweinfurt Technical University of Applied Sciences

11:15-11:20, Paper ThCT22.1	Add to My Program
A Parameter-Efficient Tuning Framework for Language-Guided Object Grounding and Robot Grasping

Yu, Houjian	University of Minnesota, Twin Cities
Li, Mingen	University of Minnesota Twin Cities
Rezazadeh, Alireza	University of Minnesota
Yang, Yang	Meta
Choi, Changhyun	University of Minnesota, Twin Cities
Keywords: Perception for Grasping and Manipulation, Semantic Scene Understanding, Deep Learning in Grasping and Manipulation Abstract: The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large Language Models (MLLMs) have shown promising results, their extensive computation and data demands limit the feasibility of local deployment and customization. To address this, we propose a novel CLIP-based multimodal parameter-efficient tuning (PET) framework designed for three language-guided object grounding and grasping tasks: (1) Referring Expression Segmentation (RES), (2) Referring Grasp Synthesis (RGS), and (3) Referring Grasp Affordance (RGA). Our approach introduces two key innovations: a bi-directional vision-language adapter that aligns multimodal inputs for pixel-level language understanding and a depth fusion branch that incorporates geometric cues to facilitate robot grasping predictions. Experiment results demonstrate superior performance in the RES object grounding task compared with existing CLIP-based full-model tuning or PET approaches. In the RGS and RGA tasks, our model not only effectively interprets object attributes based on simple language descriptions but also shows strong potential for comprehending complex spatial reasoning scenarios, such as multiple identical objects present in the workspace.

11:20-11:25, Paper ThCT22.2	Add to My Program
Cascaded Diffusion Models for Neural Motion Planning

Sharma, Mohit	Carnegie Mellon University
Fishman, Adam	OpenAI
Kumar, Vikash	Meta AI
Paxton, Chris	Meta AI
Kroemer, Oliver	Carnegie Mellon University
Keywords: AI-Based Methods, Deep Learning in Grasping and Manipulation Abstract: Robots in the real world need to perceive and move to goals in complex environments without collisions. Avoiding collisions is especially difficult when relying on sensor perception and when goals are among clutter. Diffusion policies and other generative models have shown strong performance in solving textit{local} planning problems, but often struggle at avoiding all of the subtle constraint violations that characterize truly challenging global motion planning problems. In this work, we propose an approach for learning global motion planning using diffusion policies, allowing the robot to generate full trajectories through complex scenes and reasoning about multiple obstacles along the path. Our approach uses cascaded hierarchical models which unify global prediction and local refinement together with online plan repair to ensure the trajectories are collision free. Our method outperforms (approx 5%) a wide variety of baselines on challenging tasks in multiple domains including navigation and manipulation.

11:25-11:30, Paper ThCT22.3	Add to My Program
Reinforcement Learning with Lie Group Orientations for Robotics

Schuck, Martin	Technical University of Munich
Bruedigam, Jan	Technical University of Munich
Hirche, Sandra	Technische Universität München
Schoellig, Angela P.	TU Munich
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Grasping Abstract: Handling orientations of robots and objects is a crucial aspect of many applications. Yet, ever so often, there is a lack of mathematical correctness when dealing with orientations, especially in learning pipelines involving, for example, artificial neural networks. In this paper, we investigate reinforcement learning with orientations and propose a simple modification of the network's input and output that adheres to the Lie group structure of orientations. As a result, we obtain a practically efficient implementation that is directly usable with existing learning libraries and achieves significantly better performance than other common orientation representations. We briefly introduce Lie theory specifically for orientations in robotics to motivate and outline our approach. Subsequently, a thorough empirical evaluation of different combinations of orientation representations for states and actions demonstrates the superior performance of our proposed approach in different scenarios, including: direct orientation control, end effector orientation control, and pick-and-place tasks.

11:30-11:35, Paper ThCT22.4	Add to My Program
DexTouch: Learning to Seek and Manipulate Objects with Tactile Dexterity

Lee, Kang-Won	Dongguk University
Qin, Yuzhe	UC San Diego
Wang, Xiaolong	UC San Diego
Lim, Soo-Chul	Dongguk University
Keywords: Dexterous Manipulation, Reinforcement Learning, AI-Enabled Robotics Abstract: The sense of touch is an essential ability for skillfully performing a variety of tasks, providing the capacity to search and manipulate objects without relying on visual information. In this paper, we introduce a multi-finger robot system designed to manipulate objects using the sense of touch, without relying on vision. For tasks that mimic daily life, the robot uses its sense of touch to manipulate randomly placed objects in dark. The objective of this study is to enable robots to perform manipulation without vision by using tactile sensation to compensate for the information gap caused by the absence of vision, given the presence of prior information. Training the policy through reinforcement learning in simulation and transferring the trained policy to the real environment,we demonstrate that manipulationwithout visual input can be applied to robots without vision. In addition, the experiments showcase the importance of tactile sensing in tasks performed without vision. Our project page is available at https://lee-kangwon.github.io/dextouch/

11:35-11:40, Paper ThCT22.5	Add to My Program
Catch It! Learning to Catch in Flight with Mobile Dexterous Hands

Zhang, Yuanhang	Carnegie Mellon University
Liang, Tianhai	Tsinghua University
Chen, Zhenyang	Georgia Institute of Technology
Ze, Yanjie	Stanford University
Xu, Huazhe	Tsinghua University
Keywords: Mobile Manipulation, Deep Learning in Grasping and Manipulation, Reinforcement Learning Abstract: Catching objects in flight (i.e., thrown objects) is a common daily skill for humans, yet it presents a significant challenge for robots. This task requires a robot with agile and accurate motion, a large spatial workspace, and the ability to interact with diverse objects. In this paper, we build a mobile manipulator composed of a mobile base, a 6-DoF arm, and a 12-DoF dexterous hand to tackle such a challenging task. We propose a two-stage reinforcement learning framework to efficiently train a whole-body-control catching policy for this high-DoF system in simulation. The objects' throwing configurations, shapes, and sizes are randomized during training to enhance policy adaptivity to various trajectories and object characteristics in flight. The results show that our trained policy catches diverse objects with randomly thrown trajectories, at a high success rate of about 80% in simulation, with a significant improvement over the baselines. The policy trained in simulation can be directly deployed in the real world with onboard sensing and computation, which achieves catching sandbags in various shapes, randomly thrown by humans. Our project page is available at href{https://mobile-dex-catch.github.io/}{https://mobile-d ex-catch.github.io}


ThCT23 Regular Session, 412	Add to My Program
Legged Robots

Chair: Johnson, Aaron M.	Carnegie Mellon University
Co-Chair: Zhao, Ding	Carnegie Mellon University

11:15-11:20, Paper ThCT23.1	Add to My Program
Adaptive Complexity Model Predictive Control

Norby, Joseph	Apptronik
Tajbakhsh, Ardalan	Carnegie Mellon University
Yang, Yanhao	Oregon State University
Johnson, Aaron M.	Carnegie Mellon University
Keywords: Optimization and Optimal Control, Legged Robots, Underactuated Robots, Dynamics Abstract: This work introduces a formulation of model predictive control (MPC) which adaptively reasons about the complexity of the model while maintaining feasibility and stability guarantees. Existing approaches often handle computational complexity by shortening prediction horizons or simplifying models, both of which can result in instability. Inspired by related approaches in behavioral economics, motion planning, and biomechanics, our method solves MPC problems with a simple model for dynamics and constraints over regions of the horizon where such a model is feasible and a complex model where it is not. The approach leverages an interleaving of planning and execution to iteratively identify these regions, which can be safely simplified if they satisfy an exact template/anchor relationship. We show that this method does not compromise the stability and feasibility properties of the system, and measure performance in simulation experiments on a quadrupedal robot executing agile behaviors over terrains of interest. We find that this adaptive method enables more agile motion (55% increase in top speed) and expands the range of executable tasks compared to fixed-complexity implementations.

11:20-11:25, Paper ThCT23.2	Add to My Program
Benchmarking Different QP Formulations and Solvers for Dynamic Quadrupedal Walking

Stark, Franek	Robotics Innovation Center, DFKI GmbH
Middelberg, Jakob	German Research Center for Artificial Intelligence
Mronga, Dennis	University of Bremen, German Research Center for Artificial Inte
Vyas, Shubham	Robotics Innovation Center, DFKI GmbH
Kirchner, Frank	University of Bremen
Keywords: Performance Evaluation and Benchmarking, Whole-Body Motion Planning and Control, Legged Robots Abstract: Quadratic Programs (QPs) are widely used in the control of walking robots, especially in Model Predictive Control (MPC) and Whole-Body Control (WBC). In both cases, the controller design requires the formulation of a QP and the selection of a suitable QP solver, both requiring considerable time and expertise. While computational performance benchmarks exist for QP solvers, studies comparing optimal combinations of computational hardware (HW), QP formulation, and solver performance are lacking. In this work, we compare dense and sparse QP formulations, and multiple solving methods on different HW architectures, focusing on their computational efficiency in dynamic walking of four-legged robots using MPC. We introduce the Solve Frequency per Watt (SFPW) as a performance measure to enable a cross-hardware comparison of the efficiency of QP solvers. We also benchmark different QP solvers for WBC that we use for trajectory stabilization in quadrupedal walking. As a result, this paper recommends a starting point for practitioners on the selection of QP formulations and solvers for different HW architectures in walking robots and indicates which problems should be devoted the greater technical effort.

11:25-11:30, Paper ThCT23.3	Add to My Program
Indoor and Outdoor Multi-Terrain Stair-Climbing Robot Design

Chen, Wei-Ting	National Taiwan University
Tsui, En-Chieh	National Taiwan University
Yu, Wei-Shun	National Taiwan University
Lin, Pei-Chun	National Taiwan University
Keywords: Wheeled Robots, Legged Robots, Mechanism Design Abstract: This paper introduces an Autonomous Mobile Robot (IOMT) designed for indoor and outdoor multi-terrain environments. The robot features a four-wheel independent drive and steering system (4WID-4WIS), allowing it to maintain high maneuverability on smooth surfaces. Additionally, based on reducing the control complexity, the IOMT addresses the challenges associated with stair climbing by providing stable pitch control, which effectively reduces the impact of stairs on the robot’s posture like pitch angle. The design also incorporates a special mechanism which reducing energy consumption through its worm gear system with self-locking characteristics, and combining steering with shock absorption to simplify both the mechanism complexity. This paper not only proposes a stair climbing strategy for the IOMT configuration but also explores the impact of various design parameters on the robot’s pitch angle, ultimately validating the feasibility and development potential of the design for multi-terrain mobility.

11:30-11:35, Paper ThCT23.4	Add to My Program
WaLTER: A Wheel and Leg Tumbling Expedition Robot

Jay, David	FAMU-FSU College of Engineering
Hackett, Jacob	Florida State University
Bosscher, Paul	Harris Corporation
Hubicki, Christian	Florida State University
Clark, Jonathan	Florida State University
Keywords: Wheeled Robots, Legged Robots, Field Robots Abstract: For effective operation in challenging outdoor environments, mobile unmanned robots face stiff and competing demands including payload capacity, driving speed, range, as well as the ability to traverse rough terrain. To address these issues we introduce the hybrid wheel-leg quadrupedal robot WaLTER. WaLTER utilizes a unique combination of continuously rotating distal leg joints, actuated wheels, and a roll body DOF to efficiently drive on flat ground and effectively tumble over stairs and difficult, broken terrain. We developed intuitive teleoperation scheme and a employed deep reinforcement learning as proof of concept control techniques for the novel morphology. To test its capabilities, we constructed a multi-body simulation in MuJoCo and a 2.1-kg physical prototype for experimentation on traversability and energy economy. Our testing demonstrated the ability to traverse rougher terrain negotiation relative to larger-wheeled counterparts and reliable stair-climbing while maintaining a 4km range on a 24.4 Wh battery (COT: 1.21).

11:35-11:40, Paper ThCT23.5	Add to My Program
Deformable Multibody Modeling for Model Predictive Control in Legged Locomotion with Embodied Compliance

Ye, Keran	University of California, Riverside
Karydis, Konstantinos	University of California, Riverside
Keywords: Dynamics, Legged Robots, Compliant Joints and Mechanisms Abstract: The paper presents a method to stabilize dynamic gait for a legged robot with embodied compliance. Our approach introduces a unified description for rigid and compliant bodies to approximate their deformation and a formulation for deformable multibody systems. We develop the centroidal composite predictive deformed inertia (CCPDI) tensor of a deformable multibody system and show how to integrate it with the standard-of-practice model predictive controller (MPC). Simulation shows that the resultant control framework can stabilize trot stepping on a quadrupedal robot with both rigid and compliant spines under the same MPC configurations. Compared to standard MPC, the developed CCPDI-enabled MPC distributes the ground reactive forces closer to the heuristics for body balance, and it is thus more likely to stabilize the gaits of the compliant robot. A parametric study shows that our method preserves some robustness within a suitable envelope of key parameter values.

11:40-11:45, Paper ThCT23.6	Add to My Program
Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing

Feng, Yuming	Peking University
Hong, Chuye	Tsinghua University
Niu, Yaru	Carnegie Mellon University
Liu, Shiqi	Carnegie Mellon University
Yang, Yuxiang	Google Deepmind
Zhao, Ding	Carnegie Mellon University
Keywords: Multi-Robot Systems, Legged Robots, Reinforcement Learning Abstract: Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world. The videos and code of this work can be found at: https://collaborative-mapush.github.io/.


ThDT1 Regular Session, 302	Add to My Program
Model Predictive Control

Chair: Lin, Ming C.	University of Maryland at College Park
Co-Chair: Ding, Yanran	University of Michigan

15:15-15:20, Paper ThDT1.1	Add to My Program
Time-Correlated Model Predictive Path Integral: Smooth Action Generation for Sampling-Based Control

Lee, Minhyeong	Seoul National University
Lee, Dongjun	Seoul National University
Keywords: Motion and Path Planning, Integrated Planning and Control, Optimization and Optimal Control Abstract: In this paper, we introduce time-correlated model predictive path integral (TC-MPPI), a novel approach to mitigate action noise in sampling-based control methods. Unlike conventional smoothing techniques that rely on post-processing or additional state variables, TC-MPPI directly incorporates temporal correlation of actions into stochastic optimal control, effectively enforcing quadratic costs on action derivatives. This reformulation enables us to generate smooth action sequences without extra modifications, using a time-correlated and conditional Gaussian sampling distribution. We demonstrate the effectiveness of our approach through simulations on various robotic platforms, including a pendulum, cart-pole, 2D bicopter, 3D quadcopter, and autonomous vehicle. Simulation videos are available at https://youtu.be/nWfJ2MAV2JI.

15:20-15:25, Paper ThDT1.2	Add to My Program
Gradient-Based Trajectory Optimization with Parallelized Differentiable Traffic Simulation

Son, Sanghyun	University of Maryland
Zheng, Laura	University of Maryland, College Park
Clipp, Brian	Kitware Inc
Greenwell, Connor	Kitware Inc
Philip, Sujin	Kitware Inc
Lin, Ming C.	University of Maryland at College Park
Keywords: Simulation and Animation, Optimization and Optimal Control Abstract: We present a parallelized differentiable traffic simulator based on the Intelligent Driver Model (IDM), a car-following framework that incorporates driver behavior as key variables. Our vehicle simulator efficiently models vehicle motion, generating trajectories that can be supervised to fit real-world data. By leveraging its differentiable nature, IDM parameters are optimized using gradient-based methods. With the capability to simulate up to 2 million vehicles in real time, the system is scalable for large-scale trajectory optimization. We show that we can use the simulator to filter noise in the input trajectories (trajectory filtering), reconstruct dense trajectories from sparse ones (trajectory reconstruction), and predict future trajectories (trajectory prediction), with all generated trajectories adhering to physical laws. We validate our simulator and algorithm on several datasets including NGSIM and Waymo Open Dataset. The code is publicly available at: https://github.com/SonSang/diffidm.

15:25-15:30, Paper ThDT1.3	Add to My Program
Swept Volume-Aware Trajectory Planning and MPC Tracking for Multi-Axle Swerve-Drive AMRs

Hu, Tianxin	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Bai, Ruofei	Nanyang Technological University
Xu, Xinhang	Nanyang Technological University
Liao, Yuwen	Nanyang Technological University
Liu, Fen	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Integrated Planning and Control, Motion and Path Planning, Computational Geometry Abstract: Multi-axle autonomous mobile robots (AMRs) are set to revolutionize the future of robotics in logistics. As the backbone of next-generation solutions, these robots face a critical challenge: managing and minimizing swept volume during turns while maintaining precise control. Traditional systems designed for standard vehicles often struggle with the complex dynamics of multi-axle configurations, leading to inefficiency and increased safety risk in confined spaces. Our innovative framework overcomes these limitations by combining swept volume minimization with Signed Distance Field (SDF) path planning and model predictive control (MPC) for independent wheel steering. This approach not only plans paths with an awareness of the swept volume, but actively minimizes it in real-time, allowing each axle to follow a precise trajectory while significantly reducing the space the vehicle occupies. By predicting future states and adjusting the turning radius of each wheel, our method enhances both maneuverability and safety, even in the most constrained environments. Unlike previous works, our solution goes beyond basic path calculation and tracking, offering real-time path optimization with minimal swept volume and efficient individual axle control. To our knowledge, this is the first comprehensive approach to tackle these challenges, delivering life-saving improvements in control, efficiency, and safety for multi-axle AMRs. Furthermore, we will open-source our work to foster collaboration and enable others to advance safer and more efficient autonomous systems.

15:30-15:35, Paper ThDT1.4	Add to My Program
Efficient Trajectory Generation Based on Traversable Planes in 3D Complex Architectural Spaces

Zhang, Mengke	Zhejiang University
Tian, Zhihao	Nanjing Institute of Technology
Xia, Yaoguang	China Tobacco Zhejiang Industrial Co., Ltd
Xu, Chao	Zhejiang University
Gao, Fei	Zhejiang University
Cao, Yanjun	Zhejiang University, Huzhou Institute of Zhejiang University
Keywords: Motion and Path Planning, Field Robots, Nonholonomic Motion Planning Abstract: With the increasing integration of robots into human life, their role in architectural spaces where people spend most of their time has become more prominent. While motion capabilities and accurate localization for automated robots have rapidly developed, the challenge remains to generate efficient, smooth, comprehensive, and high-quality trajectories in these areas. In this paper, we propose a novel efficient planner for ground robots to autonomously navigate in large complex multi-layered architectural spaces. Considering that traversable regions typically include ground, slopes, and stairs, which are planar or nearly planar structures, we simplify the problem to navigation within and between complex intersecting planes. We first extract traversable planes from 3D point clouds through segmenting, merging, classifying, and connecting to build a plane-graph, which is lightweight but fully represents the traversable regions. We then build a trajectory optimization based on motion state trajectory and fully consider special constraints when crossing multi-layer planes to maximize the robot's maneuverability. We conduct experiments in simulated environments and test on a CubeTrack robot in real-world scenarios, validating the method's effectiveness and practicality.

15:35-15:40, Paper ThDT1.5	Add to My Program
Model Predictive Control with Visibility Graphs for Humanoid Path Planning and Tracking against Adversarial Opponents

Hou, Ruochen	UCLA
Fernandez, Gabriel Ikaika	University of California Los Angeles
Zhu, Mingzhang	University of California, Los Angeles
Hong, Dennis	UCLA
Keywords: Motion and Path Planning, Collision Avoidance, Optimization and Optimal Control Abstract: In this paper we detail the methods used for obstacle avoidance, path planning, and trajectory tracking that helped us win the adult-sized, autonomous humanoid soccer league in RoboCup 2024. Our team was undefeated for all seated matches and scored 45 goals over 6 games, winning the championship game 6 to 1. During the competition, a major challenge for collision avoidance was the measurement noise coming from bipedal locomotion and a limited field of view (FOV). Furthermore, obstacles would sporadically jump in and out of our planned trajectory. At times our estimator would place our robot inside a hard constraint. Any planner in this competition must also be be computationally efficient enough to re-plan and react in real time. This motivated our approach to trajectory generation and tracking. In many scenarios long-term and short-term planning is needed. To efficiently find a long-term general path that avoids all obstacles we developed DAVG (Dynamic Augmented Visibility Graphs). DAVG focuses on essential path planning by setting certain regions to be active based on obstacles and the desired goal pose. By augmenting the states in the graph, turning angles are considered, which is crucial for a large soccer playing robot as turning may be more costly. A trajectory is formed by linearly interpolating between discrete points generated by DAVG. A modified version of model predictive control (MPC) is used to then track this trajectory called cf-MPC (Collision-Free MPC). This ensures short-term planning. Without having to switch formulations cf-MPC takes into account the robot dynamics and collision free constraints. Without a hard switch the control input can smoothly transition in cases where the noise places our robot inside a constraint boundary. The nonlinear formulation runs at approximately 120 Hz, while the quadratic version achieves around 400 Hz.

15:40-15:45, Paper ThDT1.6	Add to My Program
Learning Time-Optimal Online Replanning for Distributed Model Predictive Contouring Control of Quadrotors

Guan, Xin	Zhejiang University
Zhao, Fangguo	Zhejiang University
Tian, Shunxin	Zhejiang University
Li, Shuo	Zhejiang University
Keywords: Motion and Path Planning, Aerial Systems: Mechanics and Control Abstract: Achieving time-optimal flight in real time for multi-drone systems presents significant challenges, particularly in scenarios requiring rapid responses or aggressive maneuvers. This paper introduces a novel framework that bridges the gap between time-optimal polynomial trajectory generation and optimal control, facilitating efficient online replanning (100 Hz onboard) for multiple quadrotors. Specifically, the proposed method leverages a neural network to learn optimal time allocations for polynomial trajectories, which are then integrated with Model Predictive Contouring Control to fully exploit the dynamics of quadrotors. We further extend this approach to multi-drone systems, enabling collaborative high-speed flight with reciprocal collision avoidance. We benchmark the time-optimal performance and computational efficiency of our method in a drone racing scenario and demonstrate its effectiveness in agile cooperative flight within more constrained simulation and real-world environments. The results demonstrate that the proposed method achieves agile waypoint traverse at a speed of up to 19 m/s in simulation and up to 9 m/s in two-drone real-world scenario.

15:45-15:50, Paper ThDT1.7	Add to My Program
Predictive Control with Indirect Adaptive Laws for Payload Transportation by Quadrupedal Robots

Amanzadeh, Leila	Virginia Tech University
Chunawala, Taizoon Aliasgar	Virginia Polytechnic Institute and State University
Fawcett, Randall	Virginia Polytechnic Institute and State University
Leonessa, Alexander	Virginia Tech
Akbari Hamed, Kaveh	Virginia Tech
Keywords: Legged Robots, Motion Control, Multi-Contact Whole-Body Motion Planning and Control Abstract: This paper formally develops a novel hierarchical planning and control framework for robust payload transportation by quadrupedal robots, integrating a model predictive control (MPC) algorithm with a gradient-descent-based adaptive updating law. At the framework's high level, an indirect adaptive law estimates the unknown parameters of the reduced-order (template) locomotion model under varying payloads. These estimated parameters feed into an MPC algorithm for real-time trajectory planning, incorporating a convex stability criterion within the MPC constraints to ensure the stability of the template model's estimation error. The optimal reduced-order trajectories generated by the high-level adaptive MPC (AMPC) are then passed to a low-level nonlinear whole-body controller (WBC) for tracking. Extensive numerical investigations validate the framework's capabilities, showcasing the robot's proficiency in transporting unmodeled, unknown static payloads up to 109% in experiments on flat terrains and 91% on rough experimental terrains. The robot also successfully manages dynamic payloads with 73% of its mass on rough terrains. Performance comparisons with a normal MPC and an L1-MPC indicate a significant improvement. Furthermore, comprehensive hardware experiments conducted in indoor and outdoor environments confirm the method’s efficacy on rough terrains despite uncertainties such as payload variations, push disturbances, and obstacles.


ThDT2 Regular Session, 301	Add to My Program
Learning-Based SLAM 3

Chair: Leutenegger, Stefan	Technical University of Munich
Co-Chair: Papalia, Alan	Massachusetts Institute of Technology

15:15-15:20, Paper ThDT2.1	Add to My Program
M3DSS: A Multi-Platform, Multi-Sensor, and Multi-Scenario Dataset for SLAM System

Huang, Shulei	Northeastern University
Zhang, Haotian	Northeastern University
Xu, Kang	Northeastern University
Lv, Xianwei	Northeastern University
Ma, Xiaoguang	Northeastern University
Keywords: Data Sets for SLAM, SLAM, Visual-Inertial SLAM Abstract: This paper proposed M3DSS, a multi-platform, multi-sensor, and multi-scenario dataset for Simultaneous Localization and Mapping (SLAM) systems. Fifty-five sequences were collected from multiple platforms, including a handheld equipment, an unmanned ground vehicle, a quadruped robot, a car, and an unmanned aerial vehicle. Sensors used in M3DSS included two pairs of stereo event cameras with resolutions of 640×480 and 346×260, one infrared camera, four RGB cameras, two visual-inertial sensors, four mechanical and one solid-state LiDARs, three inertial measurement units, two global navigation satellite and inertial navigation systems with real-time kinematic signals. 21 various sensors were used on 5 different platforms under various challenging scenarios, including extreme illumination, aggressive motion, low-texture, high-speed driving scenarios, etc. To the best of our knowledge, M3DSS offered the richest event-based sensory information for SLAM up to date. We comprehensively evaluated state-of-the-art SLAM approaches and identified their limitations on M3DSS. Details could be found at https://neufs-ma.github.io/M3DSS.

15:20-15:25, Paper ThDT2.2	Add to My Program
Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping

Jung, Jaehyung	Technical University of Munich
Boche, Simon	Technical University of Munich
Barbas Laina, Sebastián	TU Munich
Leutenegger, Stefan	Technical University of Munich
Keywords: Visual-Inertial SLAM, SLAM, Mapping Abstract: We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot's stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.

15:25-15:30, Paper ThDT2.3	Add to My Program
Real-Time 3D Reconstruction Via Camera-LIDAR (2D) Fusion for Mobile Robots: A Gaussian Splatting Approach

Sandula, Ajay Kumar	Indian Institute of Science, Bengaluru
Damodaran, Shriram	National Institute of Technology, Jalandhar, India
Nagaraj, Suhas	University of Maryland, College Park
Ghose, Debasish	Indian Institute of Science
Biswas, Pradipta	Indian Institute of Science
Keywords: Visual-Inertial SLAM, Mapping, Sensor Fusion Abstract: We present a novel 3D reconstruction-based SLAM (Simultaneous Localization and Mapping) approach for robots that leverage multimodal sensory input data, including a camera and a 2D lidar. By integrating these inputs with the gaussian splatting technique, our method significantly enhances performance over traditional SLAM approaches. Traditional SLAM techniques often struggle with the limitations of monocular vision and fail to accurately map and locate objects in dynamic and cluttered environments. Purely relying on camera to localize the robot and map creation is challenging in the presence of dynamic obstacles in the scene. To address this, we proposed a multimodal sensor fusion based 3D reconstruction. Our approach employs lidar-based localization to achieve precise positioning of both the camera and the robot, while utilizing the gaussian splatting technique for robust environmental mapping and 3D reconstruction. This approach is robust to dynamic obstacles in the scene. We have conducted extensive experiments in various real-world and simulated environments, demonstrating that our method not only outperforms traditional monocular SLAM approaches but also achieves higher accuracy in terms of localization and constructed map. Our results demonstrate substantial improvements in 3D reconstruction for mobile robots, achieving reduced computational load, higher FPS and enhanced scaling accuracy

15:30-15:35, Paper ThDT2.4	Add to My Program
DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding

Wu, Wenhua	Shang Hai Jiao Tong University
Wang, Guangming	University of Cambridge
Deng, Ting	Imperial College London
Aegidius, Sebastian	University College London
Shanks, Stuart	University College London
Modugno, Valerio	University College London
Kanoulas, Dimitrios	University College London
Wang, Hesheng	Shanghai Jiao Tong University
Keywords: SLAM, Mapping, Localization Abstract: Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, some challenges remain: the limited scene representation capability of implicit encoding, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness without semantic and optical flow prior in dynamic scenes, which sets it apart from other NeRF-based methods.

15:35-15:40, Paper ThDT2.5	Add to My Program
Dy3DGS-SLAM: Monocular 3DGS-SLAM System for Dynamic Environments

Li, Mingrui	Dalian University of Technology
Zhou, Yiming	Saarland University of Applied Science
Zhou, Hongxing	Beijing University of Chemical Technology
Hu, Xinggang	Dalian University of Technology
Roemer, Florian	Fraunhofer IZFP
Wang, Hongyu	Dalian University of Technology
Osman, Ahmad	Htw Saar
Keywords: SLAM, Mapping, Localization Abstract: The current SLAM methods based on NeRF or 3DGS have shown impressive results in reconstructing ideal static 3D scenes. However, they perform poorly in tracking and reconstruction when facing more challenging dynamic environments, such as real-world scenes involving dynamic elements. Although some NeRF-based SLAM methods have attempted to address these dynamic challenges, they rely on RGB-D inputs, and there is a lack of methods that work with pure RGB input. To address these challenges, we introduce Dy3DGS-SLAM, the first 3DGS-SLAM method for dynamic scenes using monocular RGB input. For tracking, our method first acquires dynamic object masks through an optical flow estimation system, then combines them with a monocular depth estimation system to obtain merged masks and recover scale. This allows us to remove dynamic objects from non-predefined scenes, enabling dense frame-to-frame mapping. For rendering, we prune the Gaussians generated by pixels with dynamic masks, while applying a scale regularizer to avoid Gaussian artifacts. We impose additional photometric, geometric, and uncertainty losses on the proxy depth to improve rendering accuracy. Experimental results show that our method achieves state-of-the-art (SOTA) tracking and rendering results in dynamic environments, while also being competitive with or outperforming RGB-D methods.

15:40-15:45, Paper ThDT2.6	Add to My Program
SGBA: Semantic Gaussian Mixture Model-Based LiDAR Bundle Adjustment

Ji, Xingyu	Nanyang Technological University
Yuan, Shenghai	Nanyang Technological University
Li, Jianping	Nanyang Technological University
Yin, Pengyu	Nanyang Technological University
Cao, Haozhi	Nanyang Technological University
Xie, Lihua	NanyangTechnological University
Keywords: Mapping, Localization, SLAM Abstract: LiDAR bundle adjustment (BA) is an effective approach to reduce the drifts in pose estimation from the front-end. Existing works on LiDAR BA usually rely on predefined geometric features for landmark representation. This reliance restricts generalizability, as the system will inevitably deteriorate in environments where these specific features are absent. To address this issue, we propose SGBA, a LiDAR BA scheme that models the environment as a semantic Gaussian mixture model (GMM) without predefined feature types. This approach encodes both geometric and semantic information, offering a comprehensive and general representation adaptable to various environments. Additionally, to limit computational complexity while ensuring generalizability, we propose an adaptive semantic selection framework that selects the most informative semantic clusters for optimization by evaluating the condition number of the cost function. Lastly, we introduce a probabilistic feature association scheme that considers the entire probability density of assignments, which can manage uncertainties in measurement and initial pose estimation. We have conducted various experiments and the results demonstrate that SGBA can achieve accurate and robust pose refinement even in challenging scenarios with low-quality initial pose estimation and limited geometric features. We plan to open source the work for the benefit of the community @ https://github.com/Ji1Xinyu/SGBA.

15:45-15:50, Paper ThDT2.7	Add to My Program
GeoRecon: Geometric Coherence for Online 3D Scene Reconstruction from Monocular Video

Wang, Yanmei	Chinese Academy of Sciences
Chu, Fupeng	Chinese Academy of Sciences
Han, Zhi	Shenyang Institute of Automation, Chinese Academy of Sciences
Tang, Yandong	Shenyang Institute of Automation, CAS
Keywords: Mapping, Cognitive Modeling Abstract: Online 3D scene reconstruction from monocular video aims to incrementally recover 3D mesh from monocular RGB videos.It enables robots to accomplish tasks involving interactions with the environment.Due to the high memory consumption of 3D data,almost all existing methods adopt the coarse-to-fine architecture,in which the voxel is progressively sparsified and split across levels.However,these methods overlook alignment between different levels,resulting in poor geometric properties of reconstructed scene.Furthermore,the whole framework relies on voxel features for supervision, lacking effective supervision of the image geometric features extracted by the feature extraction network.These geometric features are essential for further 3D scene reconstruction. To tackle the above problems,we propose GeoRecon,which achieves geometric coherent reconstruction through keyframe 2D representation self-regression and cross-level 3D voxel fea- ture alignment.Specifically,for 2D image space,to alleviate the lack of supervision in 2D feature extraction,an image recon- struction self-supervision regression constraint is introduced on the input 2D keyframes to ensure that the extracted features can learn accurate geometric features and further voxel features. For 3D voxel features space,to achieve consistent alignment between different levels,the high-level voxel features are used to constrain low-level voxel features,and achieve alignment from coarse (i.e.,low-level)voxel features to fine (i.e.,high-level)voxel features.With the design of these two components,the proposed method effectively reconstructs the geometric structures of the scene.The experimental results demonstrate the effectiveness of the proposed method.


ThDT3 Regular Session, 303	Add to My Program
Space Robotics 2

Chair: Janabi-Sharifi, Farrokh	Ryerson University
Co-Chair: Vidal-Calleja, Teresa A.	University of Technology Sydney

15:15-15:20, Paper ThDT3.1	Add to My Program
AstroLoc2: Fast Sequential Depth-Enhanced Localization for Free-Flying Robots

Soussan, Ryan	Aerodyne Industries
Moreira, Marina	Instituto Superior Técnico, Lisbon University
Coltin, Brian	Carnegie Mellon University
Smith, Trey	NASA Ames Research Center
Keywords: Space Robotics and Automation, Vision-Based Navigation, Localization Abstract: We present AstroLoc2, a monocular and time-of-flight (ToF) visual-inertial graph-based localizer used by the Astrobee free-flying robots on the International Space Station (ISS). AstroLoc2 sequentially performs odometry and absolute localization in a single process to decouple map noise from velocity and IMU bias estimation and run efficiently on resource constrained platforms. It improves monocular visual-inertial odometry robustness by adding ToF correspondence factors and uses adaptive map-matching to increase image registration reliability in dynamic environments while preserving fast matching in static ones. We evaluate the performance of AstroLoc2 on a public dataset of 10 ISS activities and show that it improves localization accuracy by 16% and success rates by 5.5% while maintaining a faster runtime than leading methods. AstroLoc2 has enabled the Astrobee robots to perform higher precision maneuvers in changing environments on the ISS. It can be configured for other limited computation platforms and we release the source code to the public.

15:20-15:25, Paper ThDT3.2	Add to My Program
Mixing Data-Driven and Geometric Models for Satellite Docking Port State Estimation Using an RGB or Event Camera

Le Gentil, Cedric	University of Toronto
Naylor, Jack	University of Sydney
Munasinghe, Nuwan	University of Technology Sydney (UTS)
Mehami, Jasprabhjit	University of Technology Sydney
Dai, Benny	University of Technology Sydney
Asavkin, Mikhail	ANT61
Dansereau, Donald	University of Sydney
Vidal-Calleja, Teresa A.	University of Technology Sydney
Keywords: Space Robotics and Automation, Visual Tracking, Deep Learning for Visual Perception Abstract: In-orbit automated servicing is a promising path towards lowering the cost of satellite operations and reducing the amount of orbital debris. For this purpose, we present a pipeline for automated satellite docking port detection and state estimation using monocular vision data from standard RGB sensing or an event camera. Rather than taking snapshots of the environment, an event camera has independent pixels that asynchronously respond to light changes, offering advantages such as high dynamic range, low power consumption and latency. This work focuses on satellite-agnostic operations (only a geometric knowledge of the actual port is required) using the recently released Lockheed Martin Mission Augmentation Port (LM-MAP) as the target. By leveraging shallow data-driven techniques to preprocess the incoming data to highlight the LM-MAP's reflective navigational aids and then using basic geometric models for state estimation, we present a lightweight and data-efficient pipeline that can be used independently with either RGB or event cameras. We demonstrate the soundness of the pipeline and perform a quantitative comparison of the two modalities based on data collected with a photometrically accurate test bench that includes a robotic arm to simulate the target satellite's uncontrolled motion. The data has been made publicly available: https://uts-ri.github.io/rgb_event_docking_port/

15:25-15:30, Paper ThDT3.3	Add to My Program
A Visual Servo System for Robotic On-Orbit Servicing Based on 3D Perception of Non-Cooperative Satellite

Zhao, Panpan	Shandong University
Jin, Li	Shandong University
Chen, Yeheng	Zhejiang Lab
Li, Jiachen	Zhejiang University
Song, Xiuqiang	Shandong University, China; Engineering Research Center of Digit
Chen, Wenxuan	Zhejiang Lab
Li, Nan	Technology and Engineering Center for Space Utilization, Chinese
Du, Wenjuan	Zhejiang Lab
Ma, Ke	Zhejiang Lab
Wang, Xiaokun	Zhejianglab
Li, Yuehua	Zhejiang Lab
Xiangxu, Xiangxu	Shandong University
Qin, Xueying	Shandong University
Keywords: Space Robotics and Automation, Perception for Grasping and Manipulation, Visual Servoing Abstract: The 3D perception of satellites, including both their shape and pose, is a key foundation for robotic on-orbit servicing. However, the demanding space environment—such as intense and dim illumination—presents significant challenges. Previous non-cooperative methods focus on specific geometric features like solar panel brackets or docking rings, overlooking the satellite's overall shape and increasing the risk of collisions during grasping. Additionally, satellites are often weakly textured, limiting the accuracy of 3D perception. To address these issues, we propose, for the first time, a 3D perception-based visual servo system of non-cooperative satellites. This system combines reconstruction and tracking to enhance shape perception and pose estimation accuracy in orbital conditions. Specifically, we employ an alternating iterative strategy to simultaneously reconstruct and track the satellite and introduce a novel constraint to fuse different cues under extreme conditions. Further, we develop a simulation environment platform, a dual-arm microgravity grasping system, and an online monitoring module to enhance system capabilities for on-orbit servicing. Synthetic and real-world datasets from the simulation environment are also created for experimental validation. Results show that each module of our system achieves state-of-the-art performance.

15:30-15:35, Paper ThDT3.4	Add to My Program
A Control Strategy for an Orbital Manipulator Equipped with an External Actuator at the End-Effector

Sena, Francesco	German Aerospace Center (DLR)
Mishra, Hrishik	German Aerospace Center (DLR)
Vijayan, Ria	German Aerospace Center (DLR)
De Stefano, Marco	German Aerospace Center (DLR)
Keywords: Space Robotics and Automation, Motion Control, Dynamics Abstract: This paper exploits the robotic capabilities of an orbital manipulator equipped with an actuation module at its end-effector to perform close-proximity robotic operations. The proposed control strategy enables repositioning the system’s center-of-mass by reconfiguring the manipulator configuration and using the end-effector-mounted thrusting mechanism to achieve displacement. The key advantage of the proposed method is that the plume impingement due to thruster firing of the servicer satellite in close-proximity operations towards the client is mitigated. This is achieved by regulating the internal motion of the manipulator such that the thrust firing does not occur near the space asset. The effectiveness of the controller is verified through a multibody dynamic simulation of an orbital manipulator.

15:35-15:40, Paper ThDT3.5	Add to My Program
Robotic Space Simulator: Controls Implementation for Auxiliary Axes and Zero-G Dynamics

Hilburn, Eddie	Texas A&M University
Pettinger, Adam	Texas A&M University
Wilkinson, Emily	Texas A&M University
Lansdowne, Ian	Texas A&M University
Ambrose, Robert	Texas A&M University
Keywords: Space Robotics and Automation, Force Control, Parallel Robots Abstract: The Robotic Space Simulator was developed as a physical simulation for in-space manipulation tasks. It incorporates external inputs to its dynamics simulation via force/torque sensors mounted to the 2 6-DoF Stewart platforms which compose its primary structure. Each platform is augmented with an additional degree of freedom in the form of an auxiliary axis - one in translation and one in rotation. Previous work has not effectively included the additional workspace provided by these auxiliary axes. Additionally, it limited the use of external force/torque inputs to the case of platform translation only because the external forces/torques due to platform motion and gravitational force were not removed from the sensor inputs prior to inclusion in the dynamic simulation. In this work, we address each of these limitations. We develop and test two methods of auxiliary axis control: Cartesian Workspace and Joint Cost-Function, and find that both methods are an improvement over the existing system. Additionally we develop and test a method for calculating the mass properties of hardware mounted to the force/torque sensors and a dynamics compensation method for this hardware. Using this technique we are able to effectively compensate for gravitational force in different platform orientations, and achieve zero-g behavior of the system.

15:40-15:45, Paper ThDT3.6	Add to My Program
Dynamics, Simulation & Control of Orbital Modules for On-Orbit Assembly

Mishra, Hrishik	German Aerospace Center (DLR)
Vicariotto, Tommaso	Politecnico Di Milano
De Stefano, Marco	German Aerospace Center (DLR)
Keywords: Space Robotics and Automation, Motion Control, Multi-Robot Systems Abstract: In the context of in-orbit assembly, modular building blocks offer the advantage of distributed launches. After the orbit injection, the overall motion control requires the individual modules to approach each other while regulating their relative shape and total formation. This kind of formation control has already been addressed for rigid body modules. However, in practical cases, each module might be a multibody (with rotors) system. To address the control problem for such a fleet of fixed-inertia multibody modules, we propose a novel dynamics formulation that is inertia-decoupled, singularity-free, and invariant of their absolute poses. We extend the passive decomposition theory for deriving new representative systems corresponding to the total momentum (locked) and relative shape variations. We exploit the dynamics to design two distinct control laws with complementary mission benefits to regulate the locked and relative motions. We also leverage the proposed formulation to design a Hardware-in-the-Loop (HIL) framework, in which the facility reproduced the relative motions while total momentum was propagated in software. Furthermore, the proposed HIL framework and the motion control are experimentally validated.

15:45-15:50, Paper ThDT3.7	Add to My Program
Int-Ball2: On-Orbit Demonstration of Autonomous Intravehicular Flight and Docking for Image Capturing and Recharging

Hirano, Daichi	Japan Aerospace Exploration Agency
Mitani, Shinji	JAXA
Watanabe, Keisuke	Japan Aerospace Exploration Agency
Nishishita, Taisei	Japan Aerospace Exploration Agency
Yamamoto, Tatsuya	Japan Aerospace Exploration Agency (JAXA)
Yamaguchi, Seiko Piotr	Japan Aerospace Exploration Agency (JAXA)
Keywords: Space Robotics and Automation, Aerial Systems: Mechanics and Control, Motion Control Abstract: This article presents the system architecture and the orbital demonstration results of the Int-Ball2, a free-flying camera robot developed by the Japan Aerospace Exploration Agency (JAXA). The purpose of the Int-Ball2 project is to assist astronauts and reduce their workload in the International Space Station (ISS). This robot is an upgrade from the first Int-Ball, enhancing the propulsion subsystem for greater maneuverability and adding a new docking station (DS) for autonomous battery recharging. This study performed comprehensive ground tests for autonomous maneuvering and docking, employing a combination of a fully software-based simulator,a hardware-in-the-loop (HIL) simulator, and a planar air-bearing facility. After a successful launch to the ISS, the Int-Ball2 demonstrated its ability to work in microgravity without relying on astronaut support. The results obtained from ground and orbital tests underscored the effectiveness of our system design and ground verification approach. Further, we present key technologies essential for the Int-Ball2's successful implementation on board the ISS. We expect the insights from this project to be invaluable to future missions involving free-flying robots in microgravity.


ThDT4 Regular Session, 304	Add to My Program
Bioinspiration and Biomimetics 2

Chair: Hasegawa, Yasuhisa	Nagoya University
Co-Chair: Ozkan-Aydin, Yasemin	University of Notre Dame

15:15-15:20, Paper ThDT4.1	Add to My Program
Harnessing Flagella Dynamics for Enhanced Robot Locomotion at Low Reynolds Number

Chikere, Nnamdi	University of Notre Dame
Ozkan-Aydin, Yasemin	University of Notre Dame
Keywords: Biologically-Inspired Robots, Biomimetics, Soft Robot Applications Abstract: Navigating environments with low Reynolds numbers (Re), where viscous forces dominate, presents unique challenges, such as the need for non-reciprocal motion dynamics. Microorganisms like algae and bacteria, with their specialized structures such as asymmetrical and flexible cilia and flagella, inspire efficient propulsion in such media. However, the mechanism for enhancing the propulsion speed of these microorganisms remains not fully understood. This study introduces a quadriflagellated, algae-inspired, cable-driven robot that mirrors these biological locomotion mechanisms. A single DC motor actuates four multi-segmented flagella, modulating their stiffness throughout the propulsion cycle. We focus on enhancing propulsion speed, hypothesizing that strategic flexibility alterations in flagella—increased during the backward stroke and decreased during the forward stroke—significantly improve propulsion speed. Our experimental results confirm this, showing a marked improvement in propulsion speed, achieving a rate of 0.7+-0.11 cm/cycle. Additionally, we explore the impact of flagella length and number on propulsion, providing valuable insights for biomedical and microfluidic research applications.

15:20-15:25, Paper ThDT4.2	Add to My Program
Development of Multi-Joint Biohybrid Soft Robot by Using Skeletal Muscle Tissue

Kim, Eunhye	Nagoya University
Takeuchi, Masaru	Nagoya University
Hasegawa, Yasuhisa	Nagoya University
Fukuda, Toshio	Nagoya University
Keywords: Biological Cell Manipulation, Micro/Nano Robots, Soft Sensors and Actuators Abstract: Various forms of biohybrid robots have been developed; however, creating robots with multiple degrees of freedom remains a challenging task. In this paper, we developed a multi-joint biohybrid robot by using skeletal muscle tissue. To achieve this, we first developed a modular bio-actuator actuated by skeletal muscle tissues. The objective of this study was to enhance the contraction force of the actuator and establish optimal experimental conditions for creating high-performance robots. By applying continuous electrical stimulation for five days during culture of bio-actuator, we were able to increase the contraction force by more than threefold. Additionally, we determined the appropriate electric field based on the electrode distance, which enabled us to establish an optimal experimental setup. We also confirmed that connecting the actuators in series can significantly increase the moving distance. Connecting two actuators in series resulted in a total movement distance equivalent to the sum of the distances of each actuator. This finding suggests the potential to create robots with a larger operational workspace. Using these actuators, we first constructed a manipulator with a rotational joint. This research is expected to contribute not only to the development of various robots utilizing bio-actuators but also to advancements in biology technology.

15:25-15:30, Paper ThDT4.3	Add to My Program
A Novel Underwater Robot with Carangiform Locomotion Achieved Via Single Degree of Actuation and Magnetically Transmitted Traveling Wave

Manduca, Gianluca	Scuola Superiore Sant'Anna
Luca, Padovani	Sapienza
Santaera, Gaspare	Sant'Anna School of Advanced Studies
Graziani, Giorgio	Sapienza University, Rome
Dario, Paolo	Scuola Superiore Sant'Anna
Romano, Donato	Scuola Superiore Sant’Anna
Stefanini, Cesare	Scuola Superiore Sant'Anna
Keywords: Biologically-Inspired Robots, Marine Robotics, Mechanism Design Abstract: The phenomenon of the “traveling wave,” commonly observed in various organisms, involves a wave that propagates along the body, serving as a locomotion mechanism. Particularly, in aquatic environments, organisms such as fish and cetaceans utilize traveling waves to propel themselves through water, minimizing fluid drag and maximizing movement efficiency. Inspired by nature, robotics has extensively explored replicating such locomotion strategies. This work presents a fish robot with an innovative magnetic transmission system. The mechanism transforms the unidirectional rotation of a single motor into an oscillatory, phase-shifted movement across the modules of the kinematic chain, generating a traveling wave along the body. The robot’s design and functionality are detailed, highlighting advancements in bio-inspired robotics for underwater applications, such as efficient and non-invasive monitoring and exploration of marine ecosystems. The fish robot achieved a swimming speed of approximately 2 body lengths per second (BL/s) with a tail-beat frequency of 3.24 Hz and a minimum Cost of Transport (CoT) of 5.33 J/(kg·m). Biomimetic robotics can play a key role in sustainable aquafarming, biodiversity conservation, and animal-robot interaction research, offering the potential to minimize ecosystem disruption and advance marine science.

15:30-15:35, Paper ThDT4.4	Add to My Program
AquaMILR: Mechanical Intelligence Simplifies Control of Undulatory Robots in Cluttered Fluid Environments

Wang, Tianyu	Georgia Institute of Technology
Mankame, Nishanth	Georgia Institute of Technology
Fernandez, Matthew	Georgia Institute of Technology
Kojouharov, Velin	Georgia Institute of Technology
Goldman, Daniel	Georgia Institute of Technology
Keywords: Biologically-Inspired Robots, Redundant Robots, Search and Rescue Robots Abstract: While undulatory swimming of elongate limbless robots has been extensively studied in open hydrodynamic environments, less research has been focused on limbless locomotion in complex, cluttered aquatic environments. Motivated by the concept of mechanical intelligence, where controls for obstacle navigation can be offloaded to passive body mechanics in terrestrial limbless locomotion, we hypothesize that principles of mechanical intelligence can be extended to cluttered hydrodynamic regimes. To test this, we developed an untethered limbless robot capable of undulatory swimming on water surfaces, utilizing a bilateral cable-driven mechanism inspired by organismal muscle actuation morphology to achieve programmable anisotropic body compliance. We demonstrated through robophysical experiments that, similar to terrestrial locomotion, an appropriate level of body compliance can facilitate emergent swim through complex hydrodynamic environments under pure open-loop control. Moreover, we found that swimming performance depends on undulation frequency, with effective locomotion achieved only within a specific frequency range. This contrasts with highly damped terrestrial regimes, where inertial effects can often be neglected. Further, to enhance performance and address the challenges posed by nondeterministic obstacle distributions, we incorporated computational intelligence by developing a real-time body compliance tuning controller based on cable tension feedback. This controller improves the robot's robustness and overall speed in heterogeneous hydrodynamic environments.

15:35-15:40, Paper ThDT4.5	Add to My Program
Ambient Flow Perception of Freely Swimming Robotic Fish Using an Artificial Lateral Line System

Dai, Hongru	Shanghaitech University
Lin, Xiaozhu	ShanghaiTech University
Chao, Kaitian	ShanghaiTech University
Wang, Yang	Shanghaitech University
Keywords: Biologically-Inspired Robots, Bioinspired Robot Learning, Marine Robotics Abstract: Robotic fish hold significant promise as efficient underwater systems, yet their inability to accurately perceive ambient flow hinders their deployment in real-world scenarios. Inspired by the natural lateral line system(LLS), a flowresponsive organ in fish that plays a crucial role in behaviors such as rheotaxis, this paper introduces the first Artificial Lateral Line System (ALLS)-based ambient flow classifier for robotic fish that allows robotic fish to perceive flow fields while swimming freely. To be specific, using just 5 pressure sensors and 3.5 minutes of swimming data, we trained a Long Short-Term Memory (LSTM) network, achieving a classification accuracy of 81.25% across 8 flow speed categories, ranging from 0.08 m/s to 0.18 m/s. A key innovation of this work is the formulation of ambient flow perception as a classification task, which not only enables the robotic fish to extract meaningful information but also enhances the robustness and generalizability of the perception framework. Extensive experiments further identify critical factors such as affecting the effectiveness of the ambient flow classifier, offering valuable insights for future development.

15:40-15:45, Paper ThDT4.6	Add to My Program
Leader-Follower Formation Enabled by Pressure Sensing in Free-Swimming Undulatory Robotic Fish

Panta, Kundan	The Pennsylvania State University
Deng, Hankun	Penn State University
DeLattre, Micah	Penn State University
Cheng, Bo	Pennsylvania State University
Keywords: Biologically-Inspired Robots, Imitation Learning, Marine Robotics Abstract: Fish use their lateral lines to sense flows and pressure gradients, enabling them to detect nearby objects and organisms. Towards replicating this capability, we demonstrated successful leader-follower formation swimming using flow pressure sensing in our undulatory robotic fish (µBot/MUBot). The follower µBot is equipped at its head with bilateral pressure sensors to detect signals excited by both its own and the leader's movements. First, using experiments with static formations between an undulating leader and a stationary follower, we determined the formation that resulted in strong pressure variations measured by the follower. This formation was then selected as the desired formation in free swimming for obtaining an expert policy. Next, a long short-term memory neural network was used as the control policy that maps the pressure signals along with the robot motor commands and the Euler angles (measured by the onboard IMU) to the steering command. The policy was trained to imitate the expert policy using behavior cloning and Dataset Aggregation (DAgger). The results show that with merely two bilateral pressure sensors and less than one hour of training data, the follower effectively tracked the leader within distances of up to 200 mm (= 1 body length) while swimming at speeds of 155 mm/s (= 0.8 body lengths/s). This work highlights the potential of fish-inspired robots to effectively navigate fluid environments and achieve formation swimming through the use of flow pressure feedback.

15:45-15:50, Paper ThDT4.7	Add to My Program
Analysis of Kinematics and Propulsion of a Self-Sensing Multi-DoF Undulating Soft Robotic Fish

Park, Myungsun	University of California San Diego
Cervera Torralba, Jacobo	University of California, San Diego
Adibnazari, Iman	University of California, San Deigo
Pawlak, Geno	UC San Diego
Tolley, Michael T.	University of California, San Diego
Keywords: Biologically-Inspired Robots, Soft Robot Applications, Marine Robotics Abstract: In this paper we explore kinematics ranging from anguilliform to thunniform achieved in a self-sensing multi-degree-of-freedom soft robotic fish and analyze the effect of them on the swimming. First, we examine the characteristics of the bending actuators of the robotic fish. Then, we express the kinematics of the fish as a propagating wave parameterized by three bending amplitudes and a wavelength, which are determined by the flow rates and phase shift of the pumps. We capture various motion patterns generated by different actuator inputs and directly measure the thrust generated by each pattern. We observe that the robotic swimmer can reproduce two different modes of propulsion, that are embodied by two distinct morphological patterns in nature: anguilliform and thunniform. When neither of modes are activated, propulsion is zero or even negative. Finally, we estimate the stationary swimming speed by towing the undulating fish, which satisfies the slip condition (with the speed of the body wave matching the swimming velocity). The analysis of a wide range of kinematic patterns in this study, including two extreme cases of anguilliform and thunniform modes, will provide insights for comprehensive understanding the mechanics of efficient swimming.


ThDT5 Regular Session, 305	Add to My Program
Model Predictive Control for Legged Robots 2

Chair: Wensing, Patrick M.	University of Notre Dame
Co-Chair: Park, Hae-Won	Korea Advanced Institute of Science and Technology

15:15-15:20, Paper ThDT5.1	Add to My Program
Model Predictive Parkour Control of a Monoped Hopper in Dynamically Changing Environments

Albracht, Maximilian	German Aerospace Center (DLR)
Kumar, Shivesh	DFKI GmbH
Vyas, Shubham	Robotics Innovation Center, DFKI GmbH
Kirchner, Frank	University of Bremen
Keywords: Legged Robots, Optimization and Optimal Control, Underactuated Robots Abstract: A great advantage of legged robots is their ability to operate on particularly difficult and obstructed terrain, which demands dynamic, robust, and precise movements. The study of obstacle courses provides invaluable insights into the challenges legged robots face, offering a controlled environment to assess and enhance their capabilities. Traversing it with a one-legged hopper introduces intricate challenges, such as planning over contacts and dealing with flight phases, which necessitates a sophisticated controller. A novel model predictive parkour controller is introduced, that finds an optimal path through a real-time changing obstacle course with mixed integer motion planning. The execution of this optimized path is then achieved through a state machine employing a PD control scheme with feedforward torques, ensuring robust and accurate performance.

15:20-15:25, Paper ThDT5.2	Add to My Program
Humanoid Walking Stabilization Via Model Predictive Control with Step Adjustment Based on the 3D Divergent Component of Motion

Park, Gyeongjae	Seoul National University
Kim, Myeong-Ju	Hyundai Motor Company
Lee, Kwanwoo	Seoul National University
Park, Jaeheung	Seoul National University
Keywords: Humanoid and Bipedal Locomotion, Body Balancing, Legged Robots Abstract: In this paper, as an approach to stabilize humanoid walking where the height of CoM varies, a Novel Model Predictive Control framework based on three dimensional Divergent Component of Motion (3D-DCM) is proposed. To ensure the feasible utilization of contact forces for maintaining humanoid balance, constraints on the control inputs, Virtual Repellent Point (VRP) and footstep adjustment, and their correlation are analytically formulated into a quadratic form, resulting a Quadratically Constrained Quadratic Programming. Additionally, to enable the humanoid robot to withstand disturbances over a broader range of strides or safely navigates various terrains without encountering knee stretch, the distance between the CoM and the foot is constrained in the 3D-CoM trajectory planner. The effectiveness of the proposed method is validated through simulations and real-robot experiments in scenarios involving external disturbances and step down motions.

15:25-15:30, Paper ThDT5.3	Add to My Program
MPC-QP-Based Control Framework for Compliant Behavior of Humanoid Robots in Physical Collaboration with Humans

Kumbhar, Shubham	University of Delaware
Artemiadis, Panagiotis	University of Delaware
Keywords: Legged Robots, Human-Robot Collaboration Abstract: We present a control framework specifically for physical human-humanoid collaboration involving the transportation and manipulation of heavy objects. Using this framework, the humanoid can exhibit desired levels of compliance with the object to be co-transported. This desired compliance is achieved through an admittance model. A Model Predictive Control (MPC) problem, based on a novel Interaction Linear Inverted Pendulum (I-LIP) model, generates footstep patterns that facilitate this desired compliant behavior while keeping the robot stable. Subsequently, we have an object-informed low-level quadratic program (QP) that sends control input to realize the high-level plans on the robot. The stiffness parameters of the I-LIP are modulated in real time for better compliance tracking performance of the robot. We verify all the results through simulation on the humanoid platform, the Digit, showing the prowess of the framework in collaboratively transporting heavy objects with a human.

15:30-15:35, Paper ThDT5.4	Add to My Program
Real-Time Whole-Body Control of Legged Robots with Model-Predictive Path Integral Control

Alvarez Padilla, Juan Rodolfo	Carnegie Mellon University
Zhang, John	Carnegie Mellon University
Kwok, Sofia	Carnegie Mellon University
Dolan, John M.	Carnegie Mellon University
Manchester, Zachary	Carnegie Mellon University
Keywords: Legged Robots, Multi-Contact Whole-Body Motion Planning and Control, Motion Control Abstract: This paper presents a system for enabling real-time synthesis of whole-body locomotion and manipulation policies for real-world legged robots. Motivated by recent advancements in robot simulation, we leverage the efficient parallelization capabilities of the MuJoCo simulator on a multi-core CPU to achieve fast sampling over the robot state and action trajectories. Our results show surprisingly effective real-world locomotion and manipulation capabilities with a very simple control strategy. We demonstrate our approach on several hardware and simulation experiments: robust locomotion over flat and uneven terrains, climbing over a box whose height is comparable to the robot, and pushing a box to a goal position. To our knowledge, this is the first successful deployment of whole-body sampling-based MPC on real-world legged robot hardware.

15:35-15:40, Paper ThDT5.5	Add to My Program
Wallbounce: Push Wall to Navigate with Contact-Implicit MPC

Liu, Xiaohan	Carnegie Mellon University
Dai, Cunxi	Carnegie Mellon University
Zhang, John	Carnegie Mellon University
Bishop, Arun	Carnegie Mellon University
Manchester, Zachary	Carnegie Mellon University
Hollis, Ralph	Carnegie Mellon University
Keywords: Multi-Contact Whole-Body Motion Planning and Control, Optimization and Optimal Control, Body Balancing Abstract: In this work, we introduce a framework that enables highly maneuverable locomotion using non-periodic contacts. This task is challenging for traditional optimization and planning methods to handle due to difficulties in specifying contact mode sequences in real-time. To address this, we use a bi-level contact-implicit planner and hybrid model predictive controller to draft and execute a motion plan. We investigate how this method allows us to plan arm contact events on the shmoobot, a smaller ballbot, which uses an inverse mouse-ball drive to achieve dynamic balancing with a low number of actuators. Through multiple experiments we show how the arms allow for acceleration, deceleration and dynamic obstacle avoidance that are not achievable with the mouse-ball drive alone. This demonstrates how a holistic approach to locomotion can increase the control authority of unique robot morpohologies without additional hardware by leveraging robot arms that are typically used only for manipulation. Project website: https://cmushmoobot.github.io/Wallbounce

15:40-15:45, Paper ThDT5.6	Add to My Program
Reduced-Order Model Guided Contact-Implicit Model Predictive Control for Humanoid Locomotion

Esteban, Sergio	California Institute of Technology
Kurtz, Vincent	California Institute of Technology
Ghansah, Adrian	California Institute of Technology
Ames, Aaron	Caltech
Keywords: Multi-Contact Whole-Body Motion Planning and Control, Whole-Body Motion Planning and Control, Humanoid and Bipedal Locomotion Abstract: Humanoid robots have great potential for real-world applications due to their ability to operate in environments built for humans, but their deployment is hindered by the challenge of controlling their underlying high-dimensional nonlinear hybrid dynamics. While reduced-order models like the Hybrid Linear Inverted Pendulum (HLIP) are simple and computationally efficient, they lose whole-body expressiveness. Meanwhile, recent advances in Contact-Implicit Model Predictive Control (CI-MPC) enable robots to plan through multiple hybrid contact modes, but remain vulnerable to local minima and require significant tuning. We propose a control framework that combines the strengths of HLIP and CI-MPC. The reduced-order model generates a nominal gait, while CI-MPC manages the whole-body dynamics and modifies the contact schedule as needed. We demonstrate the effectiveness of this approach in simulation with a novel 24 degree-of-freedom humanoid robot: Achilles. Our proposed framework achieves rough terrain walking, disturbance recovery, robustness under model and state uncertainty, and allows the robot to interact with obstacles in the environment, all while running online in real-time at 50 Hz.

15:45-15:50, Paper ThDT5.7	Add to My Program
CAFE-MPC: A Cascaded-Fidelity Model Predictive Control Framework with Tuning-Free Whole-Body Control

Li, He	University of Notre Dame
Wensing, Patrick M.	University of Notre Dame
Keywords: Legged Robots, Optimization and Optimal Control, Humanoid and Bipedal Locomotion, Whole-Body Control Abstract: This work introduces an optimization-based locomotion control framework for on-the-fly synthesis of complex dynamic maneuvers. At the core of the proposed framework is a cascaded-fidelity model predictive controller (CAFE-MPC). CAFE-MPC strategically relaxes the planning problem along the prediction horizon (i.e., with descending model fidelity, increasingly coarse time steps, and relaxed constraints) for computational and performance gains. This problem is numerically solved with an efficient customized multiple-shooting iLQR (MS-iLQR) solver. The action-value function from CAFE-MPC is then used as the basis for a new value-function-based whole-body control (VWBC) technique that avoids additional tuning for the WBC. We show that CAFE-MPC if configured appropriately, advances the performance of whole-body MPC without necessarily increasing computational cost. Further, we show the superior performance of the proposed VWBC over the Ricatti feedback controller in terms of constraint handling. The proposed framework enables accomplishing for the first time gymnastic-style running barrel roll on the MIT Mini Cheetah.


ThDT6 Regular Session, 307	Add to My Program
Perception for Manipulation 3

Chair: Wachs, Juan	Purdue University
Co-Chair: Ogata, Tetsuya	Waseda University

15:15-15:20, Paper ThDT6.1	Add to My Program
Accurate Robotic Pushing Manipulation through Online Model Estimation under Uncertain Object Properties

Lee, Yongseok	Pohang University of Science and Technology
Kim, Keehoon	POSTECH, Pohang University of Science and Technology
Keywords: Model Learning for Control, Manipulation Planning Abstract: Robotic pushing is a fundamental non-prehensile manipulation skill essential for handling objects that are difficult to grasp. This letter proposes a highly accurate robotic pushing framework that utilizes an online estimated model to push objects along a given nominal trajectory, despite uncertain object properties such as friction coefficients, mass distribution, and the position of the center of friction (CoF). The core concept involves estimating an optimal pushing motion model capable of representing observed local motions. A generalized form of the conventional analytical model, coupled with a moving-window Unscented Kalman Filter (UKF), serves as the online estimated model. It captures the local behavior of the pushed objects and is integrated with a model predictive control-based pushing strategy to achieve precise pushing performance. In experiments, the proposed robotic pushing framework demonstrated superior accuracy in tracking the given nominal trajectory compared to the conventional analytical model and data-driven model approaches, even when the motion model was perturbed. Additionally, the practicality of the proposed framework was showcased through a demonstration involving an autonomous robot collecting dishes, illustrating its applicability in various real-world applications.

15:20-15:25, Paper ThDT6.2	Add to My Program
Exploring the Domain-Invariant Flow Representation in Vision-Based Tactile Sensors for Omni-Hardness Perception

Yang, Xuewen	Ocean University of China
Wang, Nan	Ocean University of China
Gu, Jiayang	Ocean University of China
Zhang, Yugang	Ocean University of China
Wang, Guoyu	Ocean University of China
Song, Aiguo	Southeast University
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing Abstract: Vision-based tactile sensors have recently gained prominence due to their superior resolution and ability to capture multi-dimensional contact information. However, even when sensors share the same sensing principle, variations in production factors can lead to differences in the color patterns of tactile signals. Unlike common vision tasks, vision-based tactile perception depends on tracking light variation in colorful signals, making it more susceptible to lighting conditions and thus more prone to domain gaps. In this paper, we propose an Omni-hardness perception framework that enables adaptation across various vision-based tactile sensors. Firstly, in-depth analyses of the factors influencing the generalization of hardness perception are presented. Furthermore, the light balance module and the force scale module are coupled to regulate network learning of generalized representations. Experimental results across multiple sensors demonstrate the transferability of learned representations. Additionally, downstream tasks in natural object perception, tumor detection, and grasping stability prediction, are proposed to evaluate the potential applications. The framework’s performance shows promise for advancing general tactile sensing and embodied tactile perception.

15:25-15:30, Paper ThDT6.3	Add to My Program
Focused Blind Switching Manipulation Based on Constrained and Regional Touch States of Multi-Fingered Hand Using Deep Learning

Funabashi, Satoshi	Waseda University
Hiramoto, Atsumu	Waseda University
Chiba, Naoya	Osaka University
Schmitz, Alexander	Waseda University
Kulkarni, Shardul	Waseda University
Ogata, Tetsuya	Waseda University
Keywords: Deep Learning in Grasping and Manipulation, Force and Tactile Sensing, Multifingered Hands Abstract: To achieve a desired grasping posture (including object position and orientation), multi-finger motions need to be conducted according to the the current touch state. Specifically, when subtle changes happen during correcting the object state, not only proprioception but also tactile information from the entire hand can be beneficial. However, switching motions with high-DOFs of multiple fingers and abundant tactile information is still challenging. In this study, we propose a loss function with constraints of touch states and an attention mechanism for focusing on important modalities depending on the touch states. The policy model is AE-LSTM which consists of Autoencoder (AE) which compresses abundant tactile information and Long Short-Term Memory (LSTM) which switches the motion depending on the touch states. Motion for cap-opening was chosen as a target task which consists of subtasks of sliding an object and opening its cap. As a result, the proposed method achieved the best success rates with a variety of objects for real time cap-opening manipulation. Furthermore, we could confirm that the proposed model acquired the features of each subtask and attention on specific modalities.

15:30-15:35, Paper ThDT6.4	Add to My Program
A Magnetic-Actuated Vision-Based Whisker Array for Contact Perception and Grasping

Hu, Zhixian	Purdue University
Wachs, Juan	Purdue University
She, Yu	Purdue University
Keywords: Perception for Grasping and Manipulation, Grippers and Other End-Effectors, Force and Tactile Sensing Abstract: Tactile sensing and the manipulation of delicate objects are critical challenges in robotics. This study presents a vision-based magnetic-actuated whisker array sensor that integrates these functions. The sensor features eight whiskers arranged circularly, supported by an elastomer membrane and actuated by electromagnets and permanent magnets. A camera tracks whisker movements, enabling high-resolution tactile feedback. The sensor's performance was evaluated through object classification and grasping experiments. In the classification experiment, the sensor approached objects from four directions and accurately identified five distinct objects with a classification accuracy of 99.17% using a Multi-Layer Perceptron model. In the grasping experiment, the sensor tested configurations of eight, four, and two whiskers, achieving the highest success rate of 87% with eight whiskers. These results highlight the sensor's potential for precise tactile sensing and reliable manipulation.

15:35-15:40, Paper ThDT6.5	Add to My Program
GAPartManip: A Large-Scale Part-Centric Dataset for Material-Agnostic Articulated Object Manipulation

Cui, Wenbo	Institute of Automation, Chinese Academy of Sciences
Zhao, Chengyang	Carnegie Mellon University
Wei, Songlin	Soochow University
Zhang, Jiazhao	Peking University
Geng, Haoran	University of California, Berkeley
Chen, Yaran	Institute of Automation, Chinese Academy of Sciense
Li, Haoran	Institute of Automation, Chinese Academy of Sciences
Wang, He	Peking University
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation Abstract: Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in part-based interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a large-scale part-centric dataset for articulated object manipulation that features both photo-realistic material randomizations and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios. More information and demos can be found at: https://pku-epic.github.io/GAPartManip/.

15:40-15:45, Paper ThDT6.6	Add to My Program
High-Precision Object Pose Estimation Using Visual-Tactile Information for Dynamic Interactions in Robotic Grasping

Peng, Zicai	Beijing Institute of Technology
Cui, Te	Beijing Institute of Technology
Chen, Guangyan	Beijing Institute of Technology
Lu, Haoyang	Beijing Institute of Techonology
Yang, Yi	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Keywords: Force and Tactile Sensing, Manipulation Planning, Grasping Abstract: In various robotic applications, understanding accurate object poses for robots is essential for high-precision tasks such as factory assembly or daily insertions. Tactile sensing, which compensates for visual information, offers rich texture-based or force-based data for object pose estimation. However, previous methods for pose estimation typically overlook dynamic situations, such as slippage of grasped objects or movement of contacted objects during interactions with the environment, thus increasing the complexity of pose estimation. To address these challenges, we propose an efficient method that utilizes visual and tactile sensing to estimate object poses through particle filtering. We leverage visual information to track the pose of the contacted object in real-time and estimate the pose changes of the grasped object using displacement data obtained from tactile sensors. Our experimental evaluation on 13 objects with diverse geometric shapes demonstrated the ability to estimate high-precision poses, which revealed the robot's powerful ability to cope with dynamic scenes for compelled motion of objects, proving our framework's adaptability in practical scenarios with uncertainty.

15:45-15:50, Paper ThDT6.7	Add to My Program
Object-Aware Impedance Control for Human-Robot Collaborative Task with Online Object Parameter Estimation (I)

Park, Jinseong	Korea Institute of Machinery and Materials
Shin, Young-Sik	KIMM
Kim, Sanghyun	Kyung Hee University
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration, Compliance and Impedance Control Abstract: Physical human-robot interactions (pHRIs) can improve robot autonomy and reduce physical demands on humans. In this paper, we consider a collaborative task with a considerably long object and no prior knowledge of the object's parameters. An integrated control framework with an online object parameter estimator and a Cartesian object-aware impedance controller is proposed to realize complicated scenarios. During the transportation task, the object parameters are estimated online while a robot and human keep lifting an object. The perturbation motion is incorporated into the null space of the desired trajectory to enhance the estimator precision. An object-aware impedance controller is designed by incorporating the real-time estimation results to effectively transmit the intended human motion to the robot through the object. Experimental demonstrations of collaborative tasks, including object transportation and assembly, are implemented to show the effectiveness of our proposed method. The proposed controller was also compared to a conventional impedance controller through subjective testing and found to be more sensitive, requiring less human effort.


ThDT7 Regular Session, 309	Add to My Program
Navigation Planning

Chair: Andersson, Olov	KTH Royal Institute of Technology
Co-Chair: Petit, Louis	Université De Sherbrooke

15:15-15:20, Paper ThDT7.1	Add to My Program
SARO: Space-Aware Robot System for Terrain Crossing Via Vision-Language Model

Zhu, Shaoting	Tsinghua University
Li, Derun	Shanghai Jiao Tong University
Mou, Linzhan	University of Pennsylvania
Liu, Yong	Zhejiang University
Xu, Ningyi	Shanghai Jiao Tong University
Zhao, Hang	Tsinghua University
Keywords: AI-Enabled Robotics, Legged Robots, Autonomous Agents Abstract: The application of vision-language models (VLMs) has achieved impressive success in various robotics tasks. However, there are few explorations for foundation models used in quadruped robot navigation through terrains in 3D environments. We introduce SARO (Space-Aware Robot System for Terrain Crossing), an innovative system composed of a high-level reasoning module, a closed-loop sub-task execution module, and a low-level control policy. It enables the robot to navigate across 3D terrains and reach the goal position. For high-level reasoning and execution, we propose a novel algorithmic system taking advantage of a VLM, with a design of task decomposition and a closed-loop sub-task execution mechanism. For low-level locomotion control, we utilize the Probability Annealing Selection (PAS) method to effectively train a control policy by reinforcement learning. Numerous experiments show that our whole system can accurately and robustly navigate across several 3D terrains, and its generalization ability ensures the applications in diverse indoor and outdoor scenarios and terrains. Appendix and Videos can be found in project page: https://saro-vlm.github.io/

15:20-15:25, Paper ThDT7.2	Add to My Program
Lab2Car: A Versatile Wrapper for Deploying Experimental Planners in Complex Real-World Environments

Heim, Marc	Motional AD
Suárez-Ruiz, Francisco	Motional Inc
Bhuiyan, Ishraq	Motional
Brito, Bruno	TU Delft
Tomov, Momchil	Motional
Keywords: Autonomous Agents, Motion and Path Planning, Machine Learning for Robot Control Abstract: Human-level autonomous driving is an ever-elusive goal, with planning and decision making -- the cognitive functions that determine driving behavior -- posing the greatest challenge. Despite a proliferation of promising approaches, progress is stifled by the difficulty of deploying experimental planners in naturalistic settings. In this work, we propose Lab2Car, an optimization-based wrapper that can take a trajectory sketch from an arbitrary motion planner and convert it to a safe, comfortable, dynamically feasible trajectory that the car can follow. This allows motion planners that do not provide such guarantees to be safely tested and optimized in real-world environments. We demonstrate the versatility of Lab2Car by using it to deploy a machine learning (ML) planner and a classical planner on self-driving cars in Las Vegas. The resulting systems handle challenging scenarios, such as cut-ins, overtaking, and yielding, in complex urban environments like casino pick-up/drop-off areas. Our work paves the way for quickly deploying and evaluating candidate motion planners in realistic settings, ensuring rapid iteration and accelerating progress towards human-level autonomy.

15:25-15:30, Paper ThDT7.3	Add to My Program
One Map to Find Them All: Real-Time Open-Vocabulary Mapping for Zero-Shot Multi-Object Navigation

Busch, Finn Lukas	KTH Royal Institute of Technology
Homberger, Timon	KTH Royal Institute of Technology
Ortega Peimbert, Jesús Gerardo	KTH Royal Institute of Technology
Yang, Quantao	KTH Royal Institute of Technology
Andersson, Olov	KTH Royal Institute
Keywords: Semantic Scene Understanding, AI-Enabled Robotics, Autonomous Agents Abstract: The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on https://finnbsch.github.io/OneMap.

15:30-15:35, Paper ThDT7.4	Add to My Program
Exploring Adversarial Obstacle Attacks in Search-Based Path Planning for Autonomous Mobile Robots

Szvoren, Adrian	University College London
Liu, Jianwei	University College London
Kanoulas, Dimitrios	University College London
Tuptuk, Nilufer	University College London
Keywords: Autonomous Agents, Constrained Motion Planning, Performance Evaluation and Benchmarking Abstract: Path planning algorithms, such as the search-based A, are a critical component of autonomous mobile robotics, enabling robots to navigate from a starting point to a destination efficiently and safely. We investigated the resilience of the A algorithm in the face of potential adversarial interventions known as obstacle attacks. The adversary’s goal is to delay the robot’s timely arrival at its destination by introducing obstacles along its original path. We developed malicious software to execute the attacks and conducted experiments to assess their impact, both in simulation using TurtleBot in Gazebo and in real-world deployment with the Unitree Go1 robot. In simulation, the attacks resulted in an average delay of 36%, with the most significant delays occurring in scenarios where the robot was forced to take substantially longer alternative paths. In real-world experiments, the delays were even more pronounced, with all attacks successfully rerouting the robot and causing measurable disruptions. These results highlight that the algorithm’s robustness is not solely an attribute of its design but is significantly influenced by the operational environment. For example, in constrained environments like tunnels, the delays were maximized due to the limited availability of alternative routes.

15:35-15:40, Paper ThDT7.5	Add to My Program
Topological Mapping for Traversability-Aware Long-Range Navigation in Off-Road Terrain

Tremblay, Jean-François	McGill University
Alhosh, Julie	McGill University
Petit, Louis	Université De Sherbrooke
Lotfi, Faraz	McGill University
Landauro, Lara	McGill University
Meger, David Paul	McGill University
Keywords: Field Robots, Integrated Planning and Learning, Vision-Based Navigation Abstract: Autonomous robots navigating in off-road terrain like forests open new opportunities for automation. While off-road navigation has been studied, existing work often relies on clearly delineated pathways. We present a method allowing for long-range planning, exploration and low-level control in unknown off-trail forest terrain, using vision and GPS only. We represent outdoor terrain with a topological map, which is a set of panoramic snapshots connected with edges containing traversability information. A novel traversability analysis method is demonstrated, predicting the existence of a safe path towards a target in an image. Navigating between nodes is done using goal-conditioned behavior cloning, leveraging the power of a pretrained vision transformer. An exploration planner is presented, efficiently covering an unknown off-road area with unknown traversability using a frontiers-based approach. The approach is successfully deployed to autonomously explore two 400 m² forest sites unseen during training, in difficult conditions for navigation.

15:40-15:45, Paper ThDT7.6	Add to My Program
GPU-Enabled Parallel Trajectory Optimization Framework for Safe Motion Planning of Autonomous Vehicles

Lee, Yeongseok	Korea Advanced Institute of Science and Technology
Choi, Keun Ha	Korea Advanced Institute of Science and Technology
Kim, Kyung-Soo	KAIST(Korea Advanced Institute of Science and Technology)
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Integrated Planning and Control Abstract: This paper presents a GPU-enabled parallel trajectory optimization framework for model predictive control (MPC) in complex urban environments. It fuses the advantages of sampling-based MPC, which can cope with nonconvex costmaps through random sampling of trajectories, with the advantages of gradient-based MPC, which can generate smooth trajectories. In addition, we leverage a generalized safety-embedded MPC problem definition with a discrete barrier state (DBaS). The proposed framework has three steps: 1) a costmap builder to generate the barrier function map, 2) a seed trajectory generator to choose randomly generated trajectories to send to the optimizers, and 3) a batch trajectory optimizer to optimize each of the seed trajectories and select the best trajectory. Experiments with real-time simulations compare the effectiveness of the proposed framework, sampling-based MPC, and gradient-based MPC, which optimizes a single trajectory. The experiments also compare the application of two different control sequence sampling schemes to the proposed framework. The results show that the proposed framework performs gradient-based optimization but can plan a better trajectory even in complex environments by providing various initial guesses. We also show that the proposed framework can perform more accurate control actions than sampling-based MPC.

15:45-15:50, Paper ThDT7.7	Add to My Program
A Real-Time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization

He, Shan	Beihang University
Ma, Yalong	Beihang University
Song, Tao	Beihang University
Jiang, Yongzhi	Beihang University
Wu, Xinkai	Beihang University
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Intelligent Transportation Systems Abstract: Planning a safe and feasible trajectory for autonomous vehicles in real-time by fully utilizing perceptual information in complex urban environments is challenging. In this paper, we propose a spatio-temporal trajectory planning method based on graph optimization. It efficiently extracts the multi-modal information of the perception module by constructing a semantic spatio-temporal map through separation processing of static and dynamic obstacles, and then quickly generates feasible trajectories via sparse graph optimization based on a semantic spatio-temporal hypergraph. Extensive experiments have proven that the proposed method can effectively handle complex urban public road scenarios and perform in real time. We will also release our codes to accommodate benchmarking for the research community


ThDT8 Regular Session, 311	Add to My Program
Collision Avoidance 1

Chair: Christensen, Henrik	University of California, San Diego
Co-Chair: Hereid, Ayonga	Ohio State University

15:15-15:20, Paper ThDT8.1	Add to My Program
Sailing through Point Clouds: Safe Navigation Using Point Cloud Based Control Barrier Functions

Dai, Bolun	New York University
Khorrambakht, Rooholla	New York University
Krishnamurthy, Prashanth	New York University Tandon School of Engineering
Khorrami, Farshad	New York University Tandon School of Engineering
Keywords: Robot Safety, Collision Avoidance, Motion and Path Planning Abstract: The capability to navigate safely in an unstructured environment is crucial when deploying robotic systems in real-world scenarios. Recently, control barrier function (CBF) based approaches have been highly effective in synthesizing safety-critical controllers. In this work, we propose a novel CBF-based local planner comprised of two components: Vessel and Mariner. The Vessel is a novel scaling factor based CBF formulation that synthesizes CBFs using only point cloud data. The Mariner is a CBF-based preview control framework that is used to mitigate getting stuck in spurious equilibria during navigation. To demonstrate the efficacy of our proposed approach, we first compare the proposed point cloud based CBF formulation with other point cloud based CBF formulations. Then, we demonstrate the performance of our proposed approach and its integration with global planners using experimental studies on the Unitree B1 and Unitree Go2 quadruped robots in various environments.

15:20-15:25, Paper ThDT8.2	Add to My Program
Parallel-Constraint Model Predictive Control: Exploiting Parallel Computation for Improving Safety

Fontanari, Elias	University of Trento
Lunardi, Gianni	University of Trento
Saveriano, Matteo	University of Trento
Del Prete, Andrea	University of Trento
Keywords: Robot Safety, Optimization and Optimal Control, Motion Control Abstract: Ensuring constraint satisfaction is a key requirement for safety-critical systems, which include most robotic platforms. For example, constraints can be used for modeling joint position/velocity/torque limits and collision avoidance. Constrained systems are often controlled using Model Predictive Control, because of its ability to naturally handle constraints relying on numerical optimization. However, ensuring constraint satisfaction is challenging for nonlinear systems/constraints. A well-known tool to make controllers safe is the so-called control-invariant set (a.k.a. safe set). In our previous work we have shown that safety can be improved by letting the safe set constraint recede along the horizon. In this paper we push that idea further. We suggest to exploit parallel computation for solving several MPC problems at the same time. Each problem instantiate the safe set constraint at a different time step along the horizon. Finally, the controller can select the best solution according to some user-defined criteria. We validated this idea through extensive simulations with a 3-joint robotic arm, showing that significant improvements can be achieved, even using as little as 4 computational cores.

15:25-15:30, Paper ThDT8.3	Add to My Program
Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

Zhang, Wei	Harbin Institute of Techonolgy
Li, Pengfei	Institute for AI Industry Research (AIR), Tsinghua University
Wang, Junli	Institute of Automation, Chinese Academy of Sciences
Sun, Bingchuan	Lenovo
Jin, Qihao	Fudan University
Bao, Guangjun	Lenovo
Yu, Yang	Lenovo
Ding, Wenchao	Fudan University
Li, Peng	Tsinghua University
Chen, Yilun	Tsinghua University
Keywords: Autonomous Agents, Semantic Scene Understanding Abstract: Automatic Emergency Braking (AEB) systems are a crucial component in ensuring the safety of passengers in autonomous vehicles. Conventional AEB systems primarily rely on closed-set perception modules to recognize traffic conditions and assess collision risks. To enhance the adaptability of AEB systems in open scenarios, we propose Dual-AEB, a system combines an advanced multimodal large language model (MLLM) for comprehensive scene understanding and a conventional rule-based rapid AEB to ensure quick response times. To the best of our knowledge, Dual-AEB is the first method to incorporate MLLMs within AEB systems. Through extensive experimentation, we have validated the effectiveness of our method.

15:30-15:35, Paper ThDT8.4	Add to My Program
Estimating Control Barriers from Offline Data

Yu, Hongzhan	University of California San Diego
Farrell, Seth	University of California San Diego
Yoshimitsu, Ryo	IHI Corporation
Qin, Zhizhen	University of California, San Diego
Christensen, Henrik	University of California, San Diego
Gao, Sicun	UCSD
Keywords: AI-Based Methods, Robot Safety, Collision Avoidance Abstract: Learning-based methods for constructing control barrier functions (CBFs) are gaining popularity, for enforcing safety in practical robot control under complex dynamics and uncertainty that are hard to model. A major limitation of existing methods is their reliance on extensive sampling over the state space, making it hard to construct CBFs on real robots. In this work we introduce methods for learning neural CBFs through a fixed, sparsely-labeled dataset collected prior to training either the CBFs or the controllers. We propose novel annotation techniques based on out-of-distribution analysis to effectively propagate the information from the limited labeled data to the unlabeled data. We evaluate the proposed algorithm on real-world platforms. With limited amount of offline data, the proposed methods can achieve state-of-the-art performance for dynamic obstacle avoidance, with statistically safer and less conservative maneuvers compared to existing methods.

15:35-15:40, Paper ThDT8.5	Add to My Program
Real-Time Safe Bipedal Robot Navigation Using Linear Discrete Control Barrier Functions

Peng, Chengyang	The Ohio State University
Paredes, Victor	The Ohio State University
Castillo, Guillermo A.	The Ohio State University
Hereid, Ayonga	Ohio State University
Keywords: Humanoid and Bipedal Locomotion, Integrated Planning and Control, Collision Avoidance Abstract: Safe navigation in real-time is an essential task for humanoid robots in real-world deployment. Since humanoid robots are inherently underactuated thanks to unilateral ground contacts, a path is considered safe if it is obstacle-free and respects the robot's physical limitations and underlying dynamics. Existing approaches often decouple path planning from gait control due to the significant computational challenge caused by the full-order robot dynamics. In this work, we develop a unified, safe path and gait planning framework that can be evaluated online in real-time, allowing the robot to navigate clustered environments while sustaining stable locomotion. Our approach uses the popular Linear Inverted Pendulum (LIP) model as a template model to represent walking dynamics. It incorporates heading angles in the model to evaluate kinematic constraints essential for physically feasible gaits properly. In addition, we leverage discrete control barrier functions (DCBF) for obstacle avoidance, ensuring that the subsequent foot placement provides a safe navigation path within clustered environments. To guarantee real-time computation, we use a novel approximation of the DCBF to produce linear DCBF constraints. We validate our proposed approach in simulation using a Digit robot in randomly generated environments. The results demonstrate that the proposed approach can generate safe gaits for a non-trivial humanoid robot to navigate in a clustered environment in real-time.

15:40-15:45, Paper ThDT8.6	Add to My Program
FuzzRisk: Online Collision Risk Estimation for Autonomous Vehicles Based on Depth-Aware Object Detection Via Fuzzy Inference

Liao, Brian Hsuan-Cheng	DENSO AUTOMOTIVE Deutschland GmbH
Xu, Yingjie	Technical University of Munich
Cheng, Chih-Hong	Chalmers University of Technology
Esen, Hasan	DENSO AUTOMOTIVE Deutschland GmbH
Knoll, Alois	Tech. Univ. Muenchen TUM
Keywords: Object Detection, Segmentation and Categorization, Robot Safety, Intelligent Transportation Systems Abstract: This paper presents a novel monitoring framework that infers the level of collision risk for autonomous vehicles (AVs) based on their object detection performance. The framework takes two sets of predictions from different algorithms and associates their inconsistencies with the collision risk via fuzzy inference. The first set of predictions is obtained by retrieving safety-critical 2.5D objects from a depth map, and the second set comes from the ordinary AV's 3D object detector. We experimentally validate that, based on Intersection-over-Union (IoU) and a depth discrepancy measure, the inconsistencies between the two sets of predictions strongly correlate to the error of the 3D object detector against ground truths. This correlation allows us to construct a fuzzy inference system and map the inconsistency measures to an AV collision risk indicator. In particular, we optimize the fuzzy inference system towards an existing offline metric that matches AV collision rates well. Lastly, we validate our monitor's capability to produce relevant risk estimates with the large-scale nuScenes dataset and demonstrate that it can safeguard an AV in closed-loop simulations.

15:45-15:50, Paper ThDT8.7	Add to My Program
Adaptive Deadlock Avoidance for Decentralized Multi-Agent Systems Via CBF-Inspired Risk Measurement

Zhang, Yanze	University of Illinois Chicago
Lyu, Yiwei	Carnegie Mellon University
Jo, Siwon	University of North Carolina at Charlotte
Yang, Yupeng	University of North Carolina at Charlotte
Luo, Wenhao	University of Illinois Chicago
Keywords: Autonomous Agents, Agent-Based Systems, Multi-Robot Systems Abstract: Decentralized safe control plays an important role in multi-agent systems given the scalability and robustness without reliance on a central authority. However, without an explicit global coordinator, the decentralized control methods are often prone to deadlock --- a state where the system reaches equilibrium, causing the robots to stall. In this paper, we propose a generalized decentralized framework that unifies the Control Lyapunov Function (CLF) and Control Barrier Function (CBF) to facilitate efficient task execution and ensure deadlock-free trajectories for the multi-agent systems. As the agents approach the deadlock-related undesirable equilibrium, the framework can detect the equilibrium and drive agents away before that happens. This is achieved by a secondary deadlock resolution design with an auxiliary CBF to prevent the multi-agent systems from converging to the undesirable equilibrium. To avoid dominating effects due to the deadlock resolution over the original task-related controllers, a deadlock indicator function using CBF-inspired risk measurement is proposed and encoded in the unified framework for the agents to adaptively determine when to activate the deadlock resolution. This allows the agents to follow their original control tasks and seamlessly unlock or deactivate deadlock resolution as necessary, effectively improving task efficiency. We demonstrate the effectiveness of the proposed method through theoretical analysis, numerical simulations, and real-world experiments.


ThDT9 Regular Session, 312	Add to My Program
Task and Motion Planning 3

Chair: Park, Shinkyu	KAUST
Co-Chair: Mukherjee, Koena	NIT Silchar

15:15-15:20, Paper ThDT9.1	Add to My Program
SEAL: A Sample-Efficient Adjustment-Learning Method for Table Tennis Robot Serve

Guo, Qitong	The University of Tokyo
Shi, Xiaohang	The University of Tokyo
Murakami, Kenichi	The University of Tokyo
Jia, Ruoyu	The University of Tokyo
Yamakawa, Yuji	The University of Tokyo
Keywords: Machine Learning for Robot Control, Task and Motion Planning, Manipulation Planning Abstract: Table tennis robots have significantly advanced in performance owing to the rapid progress in deep learning and reinforcement learning technologies. However, these advancements often require a large number of training samples. Besides, research focused on the robot serve task remains relatively limited. In response to these problems, this paper proposes a sample-efficient adjustment-learning (SEAL) method for the serve task inspired by human experience in table tennis, which can inherently augment the available training samples without the need for additional sample collection. The adjustment learning does not require complex network structures but demonstrates superior performances. The models trained by adjustment learning have good generalization and robustness, that can adapt to different serve styles and reduce system transfer errors very efficiently. In addition, the random interpolation method during dataset generation stage is introduced, and the effectiveness of simultaneous learning in both joint space and Cartesian space is also demonstrated. For specific serve task, an accuracy of less than 30 mm to any designated position at the first shot is achieved.

15:20-15:25, Paper ThDT9.2	Add to My Program
Inference Based Multi-Object Reactive Search in a Partially Known Environment with Temporal Logic Specifications

Kang, Yaohui	University of Science and Technology of China
Chen, Ziyang	University of Science and Technology of China
Xia, Yanjie	University of Science and Technology of China
Kan, Zhen	University of Science and Technology of China
Keywords: Task and Motion Planning, Formal Methods in Robotics and Automation Abstract: Efficiently searching for multiple objects in a partially known environment, where only the names and locations of landmarks are available, presents significant challenges. Existing search algorithms in the literature fail to fully utilize prior knowledge to improve search efficiency, and exhibit significantly diminished efficiency when extended to multi- object search. To address these limitations, we propose an inference-based multi-object reactive search framework. This framework utilizes the COMET inference model to reason about co-occurrence values between the target objects and known landmarks, thereby enhancing search efficiency. These co-occurrence values are integrated into a reactive temporal logic motion planning strategy, which allows the robot search for multiple objects with temporal logic constraints specified by LTL and adapt dynamically if the inferred reasoning differs from the actual object arrangement encountered during the search. Extensive simulations were conducted to evaluate the feasibility and efficiency of the proposed motion planning algorithm. Results demonstrate that the integration of commonsense reasoning with reactive temporal logic planning significantly improves multi-object search efficiency. Project website: https://sites.google.com/view/imors.

15:25-15:30, Paper ThDT9.3	Add to My Program
Planning with Adaptive World Models for Autonomous Driving

Vasudevan, Arun Balajee	Carnegie Mellon University
Peri, Neehar	Carnegie Mellon University
Schneider, Jeff	Carnegie Mellon University
Ramanan, Deva	Carnegie Mellon University
Keywords: Task and Motion Planning, Behavior-Based Systems, Robust/Adaptive Control Abstract: Motion planning is crucial for safe navigation in complex urban environments. Historically, motion planners (MPs) have been evaluated with procedurally-generated simulators like CARLA. However, such synthetic benchmarks do not capture real-world multi-agent interactions. nuPlan, a recently released MP benchmark, addresses this limitation by augmenting real-world driving logs with closed-loop simulation logic, effectively turning the fixed dataset into a reactive simulator. We analyze the characteristics of nuPlan's recorded logs and find that each city has its own unique driving behaviors, suggesting that robust planners must adapt to different environments. We learn to model such unique behaviors with BehaviorNet, a graph convolutional neural network (GCNN) that predicts reactive agent behaviors using features derived from recently-observed agent histories; intuitively, some aggressive agents may tailgate lead vehicles, while others may not. To model such phenomena, BehaviorNet predicts the parameters of an agent's motion controller rather than directly predicting its spacetime trajectory (as most forecasters do). Finally, we present AdaptiveDriver, a model-predictive control (MPC) based planner that unrolls different world models conditioned on BehaviorNet's predictions. Our extensive experiments demonstrate that AdaptiveDriver achieves state-of-the-art results on the nuPlan closed-loop planning benchmark, improving over prior work by 2% on Test-14 Hard R-CLS, and generalizes even when evaluated on never-before-seen cities.

15:30-15:35, Paper ThDT9.4	Add to My Program
Subassembly to Full Assembly: Effective Assembly Sequence Planning through Graph-Based Reinforcement Learning

Shu, Chang	KAUST
Kim, Anton	KAUST
Park, Shinkyu	KAUST
Keywords: Task and Motion Planning, Assembly, Manipulation Planning Abstract: This paper proposes an assembly sequence planning framework, named Subassembly to Assembly (S2A). The framework is designed to enable a robotic manipulator to assemble multiple parts in a prespecified structure by leveraging object manipulation actions. The primary technical challenge lies in the exponentially increasing complexity of identifying a feasible assembly sequence as the number of parts grows. To address this, we introduce a graph-based reinforcement learning approach, where a graph attention network is trained using a delayed reward assignment strategy. In this strategy, rewards are assigned only when an assembly action contributes to the successful completion of the assembly task. We validate the framework's performance through physics-based simulations, comparing it against various baselines to emphasize the significance of the proposed reward assignment approach. Additionally, we demonstrate the feasibility of deploying our framework in a real-world robotic assembly scenario.

15:35-15:40, Paper ThDT9.5	Add to My Program
Fuel-Optimal Operational Speed Planning for Autonomous Trucking on Highways

Li, Wei	Inceptio
Wu, Bin	Inceptio
Xiang, Jiahao	Tongji University, Inceptio Technology
Ren, Jiaping	Inceptio Technology
Wu, Yi	Nanjing University of Posts and Telecommunications
Yang, Ruigang	University of Kentucky
Keywords: Task and Motion Planning, Logistics, Planning, Scheduling and Coordination Abstract: The rapid advancement of autonomous driving technology, particularly in autonomous trucking on highways, shows great value for enhancing efficiency and reducing costs in the logistics industry. In this work, we define the full-trip speed planning problem for autonomous trucks under delivery time and fuel consumption constraints, referred to as the Operational Speed Planning (OSP) problem. To support and accelerate research on the OSP problem, we have developed a comprehensive dataset using a fleet of over 400 trucks. The dataset contains rich, diverse information covering more than 22 million kilometers of real-world highway driving data. In addition to this static dataset, we have developed a closed-loop simulator that allows for the interactive evaluation of OSP solutions, enabling researchers to test speed planning strategies in a realistic environment. Furthermore, we provide an OSP baseline method based on dynamic programming to optimize speed planning, balancing the delivery time requirements and fuel consumption. Our extensive experiments demonstrate both the accuracy of the simulation and the effectiveness of the OSP baseline in planning optimal speeds, proving its capability to meet time constraints while improving fuel efficiency. The dataset, simulator, and baseline will be made publicly available to foster further research and innovation in this area.

15:40-15:45, Paper ThDT9.6	Add to My Program
Verifiably Following Complex Robot Instructions with Foundation Models

Quartey, Benedict	Brown University
Rosen, Eric	The AI Institute
Tellex, Stefanie	Brown
Konidaris, George	Brown University
Keywords: Task and Motion Planning, Mobile Manipulation, Semantic Scene Understanding Abstract: When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot’s alignment with an instructor’s intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38%.

15:45-15:50, Paper ThDT9.7	Add to My Program
A Hierarchical Approach for Joint Task Allocation and Path Planning

Ho, Florence	NEC Corporation, National Institute of Advanced Industrial Scien
Higa, Ryota	NEC Corporation, National Institute of Advanced Industrial Scien
Kato, Takuro	National Institute of Advanced Industrial Science and Technology
Nakadai, Shinji	NEC Corporation
Keywords: Task Planning, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents Abstract: This paper addresses the joint task allocation and path planning problem, whereby a fleet of vehicles must be optimally assigned to service multiple given tasks while their planned paths must be collision-free. Such a problem composed of two tightly coupled optimization problems has a high complexity with the number of tasks and the number of vehicles, thus optimal solvers do not scale to large size instances. Therefore, we propose a novel method to solve this problem, HTAPPS, which introduces a hierarchical resolution framework. Our proposed approach decomposes a given instance into three levels of abstractions and associated amount of details that progressively filter the search space. This allows us to reduce the computational effort required when performing task allocation and multi-agent path planning jointly. We perform simulations on automated warehouse scenarios, and compare our approach to baseline solvers. The obtained results show that our proposed approach is able to solve large size instances within a limited time.


ThDT10 Regular Session, 313	Add to My Program
Multi-Robot Systems 6

Chair: Tsiotras, Panagiotis	Georgia Tech
Co-Chair: Sartoretti, Guillaume Adrien	National University of Singapore (NUS)

15:15-15:20, Paper ThDT10.1	Add to My Program
Multi S-Graphs: An Efficient Distributed Semantic-Relational Collaborative SLAM

Fernandez-Cortizas, Miguel	Universidad Politécnica De Madrid
Bavle, Hriday	University of Luxembourg
Perez Saura, David	Computer Vision and Aerial Robotics Group (CVAR), Universidad Po
Sanchez-Lopez, Jose Luis	University of Luxembourg
Campoy, Pascual	Computer Vision & Aerial Rootics Group. Universidad Politécnica
Voos, Holger	University of Luxembourg
Keywords: Multi-Robot SLAM, SLAM, Multi-Robot Systems Abstract: Collaborative Simultaneous Localization and Mapping (CSLAM) is critical to enable multiple robots to operate in complex environments. Most CSLAM techniques rely on raw sensor measurement or low-level features such as keyframe descriptors, which can lead to wrong loop closures due to the lack of deep understanding of the environment. Moreover, the exchange of these measurements and low-level features among the robots requires the transmission of a significant amount of data, which limits the scalability of the system. To overcome these limitations, we present Multi S-Graphs, a decentralized CSLAM system that utilizes high-level semantic-relational information embedded in the four-layered hierarchical and optimizable situational graphs for cooperative map generation and localization in structured environments while minimizing the information exchanged between the robots. To support this, we present a novel room-based descriptor which, along with its connected walls, is used to perform inter-robot loop closures, addressing the challenges of multi-robot kidnapped problem initialization. Multiple experiments in simulated and real environments validate the improvement in accuracy and robustness of the proposed approach while reducing the amount of data exchanged between robots compared to other state-of-the-art approaches.

15:20-15:25, Paper ThDT10.2	Add to My Program
Language-Conditioned Offline RL for Multi-Robot Navigation

Morad, Steven	The University of Cambridge
Shankar, Ajay	University of Cambridge, UK
Blumenkamp, Jan	University of Cambrdige
Prorok, Amanda	University of Cambridge
Keywords: Multi-Robot Systems, Networked Robots, Reinforcement Learning Abstract: We present a method for synthesizing navigation policies for multi-robot teams that interpret and follow natural language instructions. We condition these policies on embeddings from pretrained Large Language Models (LLMs), and train them via offline reinforcement learning with as little as 20 minutes of randomly-collected real-world data. Experiments on a team of five real robots show that these policies generalize well to unseen commands, indicating an understanding of the LLM latent space. Our method requires no simulators or environment models, and produces low-latency control policies that can be deployed directly to real robots without finetuning. We provide videos of our experiments at https://sites.google.com/view/llm-marl.

15:25-15:30, Paper ThDT10.3	Add to My Program
Deep Reinforcement Learning for Coordinated Payload Transport in Biped-Wheeled Robots

Mehta, Dhruv K	Clemson University
Joglekar, Ajinkya	Clemson University
Krovi, Venkat	Clemson University
Keywords: Cooperating Robots, Multi-Robot Systems, Reinforcement Learning Abstract: Coordinated payload transport via a fleet of modular wheeled mobile robots offers flexibility for handling larger loads in indoor and outdoor environments. Biped-wheeled robots have recently emerged as a viable architecture for an independent/stand-alone wheeled mobile robot. In this work, we explore the use of two biped-wheeled robots that can leverage their mobility and maneuvarability for enhanced spatial pose control and stabilization for various payload transport tasks. However, coordinated control of multiple articulated wheeled robots for path tracking of a payload presents significant (and potentially competing) challenges, including kinematic redundancy, stability concerns, relative motion between the payload and robots, and precise motion control to achieve effective coordination. To address these challenges, we propose a Deep Reinforcement Learning (DRL) framework to develop the motion-plans for the system. In particular, this approach generates the ego robot's body twist and the follower robot's relative twist with respect to the ego robot. By formulating the action space of the follower robot as a relative twist, our approach facilitates pairwise interactions between robots. Furthermore, we use only relative pose information and the errors as states for the DRL controller, thereby making it agnostic to initial conditions and avoiding explicit dependency on absolute pose. We validate our approach through simulations conducted in Isaac Sim and on hardware using Diablo biped-wheeled robots with zero-shot transfer, demonstrating effective payload path tracking across varying parameters.

15:30-15:35, Paper ThDT10.4	Add to My Program
Reinforcement Learning within the Classical Robotics Stack: A Case Study in Robot Soccer

Labiosa, Adam	University of Wisconsin-Madison
Wang, Zhihan	The University of Texas at Austin
Agarwal, Siddhant	The University of Texas at Austin
Cong, William	University of Wisconsin-Madison
Hemkumar, Geethika	The University of Texas at Austin
Harish, Abhinav Narayan	University of Wisconsin Madison
Hong, Benjamin	University of Wisconsin - Madison
Kelle, Josh	University of Texas at Austin
Li, Chen	UW-Madison
Li, Yuhao	University of Wisconsin–Madison
Shao, Zisen	University of Wisconsin–Madison
Stone, Peter	University of Texas at Austin
Hanna, Josiah	University of Wisconsin -- Madison
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Multi-Robot Systems Abstract: Robot decision-making in partially observable, real-time, dynamic, and multi-agent environments remains a difficult and unsolved challenge. Model-free reinforcement learning (RL) is a promising approach to learning decision-making in such domains, however, end-to-end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi-fidelity sim2real approach and decomposing behavior into learned sub-behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system's architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL-based behaviors can be integrated into complete robot behavior architectures.

15:35-15:40, Paper ThDT10.5	Add to My Program
Residual Descent Differential Dynamic Game (RD3G) -- a Fast Newton Solver for Constrained General Sum Games

Zhang, Zhiyuan	Georgia Institute of Technology
Tsiotras, Panagiotis	Georgia Tech
Keywords: Path Planning for Multiple Mobile Robots or Agents, Integrated Planning and Control, Optimization and Optimal Control Abstract: We present Residual Descent Differential Dynamic Game (RD3G), a Newton-based solver for constrained multi- agent game-control problems. The proposed solver seeks a local Nash equilibrium for games where agents are coupled through their rewards and state constraints. By maintaining a dynamic set of active constraints, combined with a barrier function on satisfied constraints and a backtracking line search, the proposed method is able to satisfy state constraints while keeping the dimension of the Newton descent direction problem to a minimum. We compare the proposed method against state- of-the-art techniques and showcase the computational benefits of the RD3G algorithm on several example problems. The RD3G is up to 4X faster and has 2X higher convergence rate than existing approaches in higher dimensional games.

15:40-15:45, Paper ThDT10.6	Add to My Program
MARLadona - towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

Li, Zichong	ANYbotics
Bjelonic, Filip	ETH Zürich, Switzerland
Klemm, Victor	ETH Zurich
Hutter, Marco	ETH Zurich
Keywords: Cooperating Robots, Multi-Robot Systems, Reinforcement Learning Abstract: Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open-source multi-agent soccer environment. Utilizing our MARL framework and a modified a global entity encoder (GEE) as our core architecture, our approach achieves a 66.8 % win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. In addition, we provided an in-depth analysis of the policy behavior and interpreted the agent’s intention using the critic network.

15:45-15:50, Paper ThDT10.7	Add to My Program
Multi-Agent Inverse Q-Learning from Demonstrations

Haynam, Nathaniel	UC Berkeley
Khoja, Adam	UC Berkeley
Kumar, Dhruv	UC Berkeley
Myers, Vivek	UC Berkeley
Bıyık, Erdem	University of Southern California
Keywords: Multi-Robot Systems, Imitation Learning Abstract: When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies. In the single-agent case, Inverse Reinforcement Learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scale with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .


ThDT11 Regular Session, 314	Add to My Program
Robot Vision 2

Chair: Wang, Lin	Nanyang Technological University (NTU)
Co-Chair: Le Gentil, Cedric	University of Toronto

15:15-15:20, Paper ThDT11.1	Add to My Program
LoGS: Visual Localization for Mobile Robots with Gaussian Splatting

Cheng, Yuzhou	University College London
Jiao, Jianhao	University College London
Wang, Yue	Zhejiang University
Kanoulas, Dimitrios	University College London
Keywords: Localization, Mapping, RGB-D Perception Abstract: Visual localization involves estimating a query im-age’s 6-DoF (degrees of freedom) camera pose, which is a funda-mental component in various computer vision and robotic tasks. This paper presents LoGS, a vision-based localization pipeline utilizing the 3D Gaussian Splatting (GS) technique as scene representation. This novel representation allows high-quality novel view synthesis. During the mapping phase, structure-from-motion (SfM) is applied first, followed by the generation of a GS map. During localization, the initial position is obtained through image retrieval, local feature matching coupled with a PnP solver, and then a high-precision pose is achieved through the analysis-by-synthesis manner on the GS map. Experimental results on four large-scale datasets demonstrate the proposed approach’s SoTA accuracy in estimating camera poses and robustness under challenging few-shot conditions.

15:20-15:25, Paper ThDT11.2	Add to My Program
Unified Human Localization and Trajectory Prediction with Monocular Vision

Luan, Po-Chien	EPFL
Gao, Yang	EPFL
Demonsant, Céline	EPFL
Alahi, Alexandre	EPFL
Keywords: Intelligent Transportation Systems, Localization, Computer Vision for Transportation Abstract: Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird’s Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability.

15:25-15:30, Paper ThDT11.3	Add to My Program
HGSLoc: 3DGS-Based Heuristic Camera Pose Refinement

Niu, Zhongyan	National University of Defense Technology
Tan, Zhen	National University of Defense Technology
Zhang, Jinpu	National University of Defense Technology
Yang, Xueliang	National University of Defense Technology
Hu, Dewen	National University of Defense Technology
Keywords: Localization, Visual Learning, Computer Vision for Automation Abstract: Visual localization refers to the process of determining camera poses and orientation within a known scene representation. This task is often complicated by factors such as changes in illumination and variations in viewing angles. In this paper, we propose HGSLoc, a novel lightweight plug-and-play pose optimization framework, which integrates 3D reconstruction with a heuristic refinement strategy to achieve higher pose estimation accuracy. Specifically, we introduce an explicit geometric map for 3D representation and high-fidelity rendering, allowing the generation of high-quality synthesized views to support accurate visual localization. Our method demonstrates higher localization accuracy compared to NeRF-based neural rendering localization approaches. We introduce a heuristic refinement strategy, its efficient optimization capability can quickly locate the target node, while we set the step-level optimization step to enhance the pose accuracy in the scenarios with small errors. With carefully designed heuristic functions, it offers efficient optimization capabilities, enabling rapid error reduction in rough localization estimations. Our method mitigates the dependence on complex neural network models while demonstrating improved robustness against noise and higher localization accuracy in challenging environments, as compared to neural network joint optimization strategies. The optimization framework proposed in this paper introduces novel approaches to visual localization by integrating the advantages of 3D reconstruction and the heuristic refinement strategy, which demonstrates strong performance across multiple benchmark datasets, including 7Scenes and Deep Blending dataset. The implementation of our method has been released at https://github.com/anchang699/HGSLoc.

15:30-15:35, Paper ThDT11.4	Add to My Program
Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus

Zhang, Jinchang	University of Georgia
Xu, Ningning	University of Georgia
Zhang, Hao	University of Massachusetts Amherst
Lu, Guoyu	University of Georgia
Keywords: Range Sensing, Mapping, RGB-D Perception Abstract: Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.

15:35-15:40, Paper ThDT11.5	Add to My Program
GSFusion: Online RGB-D Mapping Where Gaussian Splatting Meets TSDF Fusion

Wei, Jiaxin	Technical University of Munich
Leutenegger, Stefan	Technical University of Munich
Keywords: Mapping, RGB-D Perception Abstract: Traditional volumetric fusion algorithms preserve the spatial structure of 3D scenes, which is beneficial for many tasks in computer vision and robotics. However, they often lack realism in terms of visualization. Emerging 3D Gaussian splatting bridges this gap, but existing Gaussian-based reconstruction methods often suffer from artifacts and inconsistencies with the underlying 3D structure, and struggle with real-time optimization, unable to provide users with immediate feedback in high quality. One of the bottlenecks arises from the massive amount of Gaussian parameters that need to be updated during optimization. Instead of using 3D Gaussian as a standalone map representation, we incorporate it into a volumetric mapping system to take advantage of geometric information and propose to use a quadtree data structure on images to drastically reduce the number of splats initialized. In this way, we simultaneously generate a compact 3D Gaussian map with fewer artifacts and a volumetric map on the fly. Our method, GSFusion, significantly enhances computational efficiency without sacrificing rendering quality, as demonstrated on both synthetic and real datasets. Code is available at https://github.com/goldoak/GSFusion.

15:40-15:45, Paper ThDT11.6	Add to My Program
San Francisco World: Leveraging Structural Regularities of Slope for 3-DoF Visual Compass

Ham, Jungil	Gwangju Institute of Science and Technology
Kim, Minji	Gwangju Institute of Science and Technology
Kang, Suyoung	University of Massachusetts Amherst
Joo, Kyungdon	UNIST
Li, Haoang	Hong Kong University of Science and Technology (Guangzhou)
Kim, Pyojin	Gwangju Institute of Science and Technology (GIST)
Keywords: Mapping, Vision-Based Navigation, RGB-D Perception Abstract: We propose the San Francisco world (SFW) model, a novel structural model inspired by San Francisco's hilly terrain, enabling 3D inter-floor navigation in urban areas rather than being limited to 2D intra-floor navigation of various robotics platforms. Our SFW consists of a single vertical dominant direction (VDD), two horizontal dominant directions (HDDs), and four sloping dominant directions (SDDs) sharing a common inclination angle. Although SFW is a more general model than the Manhattan world (MW), it is a more compact model than the mixture of Manhattan world (MMW). Leveraging the structural regularities of SFW, such as uniform inclination angle and geometric patterns of the four SDDs, we design an efficient and robust DD/vanishing point estimation method by aggregating sloping line normals on the Gaussian sphere. We further utilize the structural patterns of SFW for the 3-DoF visual compass, the rotational motion tracking from a single line and plane, which corresponds to the theoretical minimal sampling for 3-DoF rotation estimation. Our method demonstrates enhanced adaptability in more challenging inter-floor scenes in urban areas and the highest rotational tracking accuracy compared to state-of-the-art methods. We release the first dataset of sequential RGB-D images captured in San Francisco world (SFW) and open source codes at: https://SanFranciscoWorld.github.io/.

15:45-15:50, Paper ThDT11.7	Add to My Program
Monocular 360 Depth Estimation Via Spherical Fully-Connected CRFs

Cao, Zidong	HKUST
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Omnidirectional Vision, Deep Learning for Visual Perception Abstract: Monocular 360 depth estimation poses significant challenges due to the inherent distortion of the equirectangular projection (ERP). This distortion separates adjacent spherical points after their projection onto the ERP plane, especially in the polar regions, resulting in insufficient spherical relationships. To address this issue, recent methods calculate spherical neighbors within the tangent domain. However, since the tangent patch and the sphere share only one common point, spherical relationships are established only among neighbors around this common point. In this paper, we propose Spherical Fully-Connected CRFs (SF-CRFs). We start by evenly partitioning an ERP image into regular windows, where windows at the equator have broader spherical neighbors than those at the poles. To enhance spherical relationships, our SF-CRFs feature two key components. Firstly, to include sufficient spherical neighbors, we introduce a Spherical Window Transform (SWT) module. This module replicates the equator window’s spherical relationships across all other windows, leveraging the rotational invariance of the sphere. Remarkably, the transformation process is efficient, transforming all windows in a 512x1024 ERP image in just 0.038 seconds on a CPU. Secondly, we introduce a Planar-Spherical Interaction (PSI) module to calculate the SF-CRFs, which facilitates the relationships between regular and transformed windows. By integrating SF-CRFs blocks into a decoder, we propose CRF360D, a novel 360 depth estimation framework that achieves state-of-the-art performance across diverse datasets. Our CRF360D is compatible with different perspective image-trained backbones, serving as the encoder.


ThDT12 Regular Session, 315	Add to My Program
Motion Control 1

Chair: Zhang, Cheng	Texas A&M University
Co-Chair: Roncone, Alessandro	University of Colorado Boulder

15:15-15:20, Paper ThDT12.1	Add to My Program
Bidirectional Energy Flow Modulation for Passive Admittance Control

Lee, Donghyeon	Pohang University of Science and Technology(POSTECH)
Ko, Dongwoo	POSTECH
Kim, Min Jun	KAIST
Chung, Wan Kyun	POSTECH
Keywords: Compliance and Impedance Control, Force Control, Physical Human-Robot Interaction, Passivity-based Control Abstract: Admittance control is a control scheme to enable physical interactions of a robot, but it easily induces instability when the robot contacts a rigid surface. In this study, passivity analysis was conducted on a robotic system with admittance control. The results showed that coupled stability with the environment can be ensured when the velocity error between the proxy and the real robot is eliminated. Thus, an adaptive structure modification method is proposed to suppress the possible source of instability. In addition, the energy tank method is combined with the proposed method to ensure system passivity. As a proof of concept, three robot experiments were performed, and the results of the proposed method were compared with those of the conventional admittance control and impedance control (with friction compensation). The comparison showed that the proposed method could make the system passive while it realized the desired dynamics during the interaction.

15:20-15:25, Paper ThDT12.2	Add to My Program
A Minimum-Jerk Approach to Handle Singularities in Virtual Fixtures

Braglia, Giovanni	University of Modena and Reggio Emilia
Calinon, Sylvain	Idiap Research Institute
Biagiotti, Luigi	University of Modena and Reggio Emilia
Keywords: Human-Robot Collaboration, Motion and Path Planning, Optimization and Optimal Control Abstract: Implementing virtual fixtures in guiding tasks constrains the movement of the robot's end effector to specific curves within its workspace. However, incorporating guiding frameworks may encounter discontinuities when optimizing the reference target position to the nearest point relative to the current robot position. This article aims to give a geometric interpretation of such discontinuities, with specific reference to the commonly adopted Gauss-Newton algorithm. The effect of such discontinuities, defined as Euclidean Distance Singularities, is experimentally proved.We then propose a solution that is based on a linear quadratic tracking problem with minimum jerk command, then compare and validate the performances of the proposed framework in two different human-robot interaction scenarios.

15:25-15:30, Paper ThDT12.3	Add to My Program
Continuous Wrist Control on the Hannes Prosthesis: A Vision-Based Shared Autonomy Framework

Vasile, Federico	Istituto Italiano Di Tecnologia
Maiettini, Elisa	Humanoid Sensing and Perception, Istituto Italiano Di Tecnologia
Pasquale, Giulia	Istituto Italiano Di Tecnologia
Boccardo, Nicolò	IIT - Istituto Italiano Di Tecnologia
Natale, Lorenzo	Istituto Italiano Di Tecnologia
Keywords: Sensor-based Control, Deep Learning for Visual Perception, Prosthetics and Exoskeletons Abstract: Most control techniques for prosthetic grasping focus on dexterous fingers control, but overlook the wrist motion. This forces the user to perform compensatory movements with the elbow, shoulder and hip to adapt the wrist for grasping. We propose a computer vision-based system that leverages the collaboration between the user and an automatic system in a shared autonomy framework, to perform continuous control of the wrist degrees of freedom in a prosthetic arm, promoting a more natural approach-to-grasp motion. Our pipeline allows to seamlessly control the prosthetic wrist to follow the target object and finally orient it for grasping according to the user intent. We assess the effectiveness of each system component through quantitative analysis and finally deploy our method on the Hannes prosthetic arm. Code and videos: https://hsp-iit.github.io/hannes-wrist-control

15:30-15:35, Paper ThDT12.4	Add to My Program
Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control

Wang, Haochen	Shandong University
Zhiwei, Shi	Shandong University
Zhu, Chengxi	Shandong University
Qiao, Yafei	Shandong University
Zhang, Cheng	Texas A&M University
Yang, Fan	Deepcode Robotics Co., Ltd
Ren, Pengjie	Shandong University
Lu, Lan	Shanghai Jiao Tong University
Xuan, Dong	Shandong University
Keywords: Product Design, Development and Prototyping, AI-Enabled Robotics, Autonomous Agents Abstract: Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce Hamlet, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed “IL+RL” training framework for learning-based arm policy. In this train framework, a model-based strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and 90.7% success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks e.g., agile catching, table tennis. A video demonstrating our system can be viewed at https://youtu.be/8-ixKAD18Mk.

15:35-15:40, Paper ThDT12.5	Add to My Program
Leveraging Symmetry to Accelerate Learning of Trajectory Tracking Controllers for Free-Flying Robotic Systems

Welde, Jake	University of Pennsylvania
Rao, Nishanth Arun	University of Pennsylvania
Kunapuli, Pratik	University of Pennsylvania
Jayaraman, Dinesh	University of Pennsylvania
Kumar, Vijay	University of Pennsylvania
Keywords: Dynamics, Reinforcement Learning, Aerial Systems: Mechanics and Control Abstract: Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these challenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional "quotient" MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fully-actuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error at convergence.

15:40-15:45, Paper ThDT12.6	Add to My Program
Quadratic Programming-Based Reference Spreading Control for Dual-Arm Robotic Manipulation with Planned Simultaneous Impacts

van Steen, Jari J.	Eindhoven University of Technology
van den Brandt, Gijs	Eindhoven University of Technology
van de Wouw, Nathan	Eindhoven University of Technology
Kober, Jens	TU Delft
Saccon, Alessandro	Eindhoven University of Technology - TU/e
Keywords: Impact-aware manipulation, Motion Control of Manipulators, Dual Arm Manipulation, Learning from Demonstration Abstract: With the aim of further enabling the exploitation of intentional impacts in robotic manipulation, a control framework is presented that directly tackles the challenges posed by tracking control of robotic manipulators that are tasked to perform nominally simultaneous impacts. This framework is an extension of the reference spreading (RS) control framework, in which overlapping anteand post-impact references that are consistent with impact dynamics are defined. In this work, such a reference is constructed starting from a teleoperation-based approach. By using the corresponding ante- and post-impact control modes in the scope of a quadratic programming control approach, peaking of the velocity error and control inputs due to impacts is avoided while maintaining high tracking performance. With the inclusion of a novel interim mode, we aim to also avoid input peaks and steps when uncertainty in the environment causes a series of unplanned single impacts to occur rather than the planned simultaneous impact. This work in particular presents for the first time an experimental evaluation of RS control on a robotic setup, showcasing its robustness against uncertainty in the environment compared to three baseline control approaches.

15:45-15:50, Paper ThDT12.7	Add to My Program
HARMONIOUS - Human-Like Reactive Motion Control and Multimodal Perception for Humanoid Robots

Rozlivek, Jakub	Czech Technical University in Prague, Faculty of Electrical Engi
Roncone, Alessandro	University of Colorado Boulder
Pattacini, Ugo	Istituto Italiano Di Tecnologia
Hoffmann, Matej	Czech Technical University in Prague, Faculty of Electrical Engi
Keywords: Humanoid Robots, Physical Human-Robot Interaction, Collision Avoidance, Biologically-Inspired Robots Abstract: For safe and effective operation of humanoid robots in human-populated environments, the problem of commanding a large number of Degrees of Freedom (DoF) while simultaneously considering dynamic obstacles and human proximity has still not been solved. We present a new reactive motion controller that commands two arms of a humanoid robot and three torso joints (17 DoF in total). We formulate a quadratic program that seeks joint velocity commands respecting multiple constraints while minimizing the magnitude of the velocities. We introduce a new unified treatment of obstacles that dynamically maps visual and proximity (pre-collision) and tactile (post-collision) obstacles as additional constraints to the motion controller, in a distributed fashion over the surface of the upper body of the iCub robot (with 2000 pressure-sensitive receptors). This results in a bio-inspired controller that: (i) gives rise to a robot with whole-body visuo-tactile awareness, resembling peripersonal space representations, and (ii) produces human-like minimum jerk movement profiles. The controller was extensively experimentally validated, including a physical human-robot interaction scenario.


ThDT13 Regular Session, 316	Add to My Program
Resiliency and Security 1

Chair: Kaur, Upinder	Purdue University
Co-Chair: Gu, Jason	Dalhousie University

15:15-15:20, Paper ThDT13.1	Add to My Program
FedDet: Data Poisoning Attack Detection for Federated Skeleton-Based Action Recognition

Kim, Min Hyuk	Chonnam National University
Lee, Eungi	Chonnam National University
Yoo, Seok Bong	Chonnam National University
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Recognition Abstract: Skeleton-based action recognition (SAR) models often centralize skeleton data, increasing significant privacy concerns. To address this, decentralized training models for SAR have been advanced, particularly using federated learning (FL), a research area of considerable value with wide-ranging applications, including human-robot interaction, camera-enabled devices, and security surveillance. However, FL-based SAR faces the challenge of substantial accuracy degradation due to data poisoning attacks; thus, it requires the identification of malicious clients. This paper introduces a novel approach for detecting data poisoning attacks in federated SAR, called FedDet. The method involves creating prototypes of perspective transform and exchanging these matrices between the clients and server to identify the malicious client. Additionally, a prototype-guided attack detector is developed, incorporating spatiotemporal matching to analyze the correlation between prototype skeleton data. Experimental results on FL frameworks and SAR models demonstrate that the proposed approach outperforms existing models. Our code is publicly available at https://github.com/alsgur0720/federated-detection.

15:20-15:25, Paper ThDT13.2	Add to My Program
ROS2WASM: Bringing the Robot Operating System to the Web

Fischer, Tobias	Queensland University of Technology
Paredes, Isabel	QuantStack, RWTH Aachen
Batchelor, Michael	Queensland University of Technology
Beier, Thorsten	QuantStack
Haviland, Jesse	Queensland University of Technology
Traversaro, Silvio	Istituto Italiano Di Tecnologia
Vollprecht, Wolf Kristian	QuantStack
Schmitz, Markus	RWTH Aachen University
Milford, Michael J	Queensland University of Technology
Keywords: Software Tools for Robot Programming, Software Tools for Benchmarking and Reproducibility, Engineering for Robotic Systems Abstract: The Robot Operating System (ROS) has become the de facto standard middleware in robotics, widely adopted across domains ranging from education to industrial applications. The RoboStack distribution, a conda-based packaging system for ROS, has extended ROS's accessibility by facilitating installation across all major operating systems and architectures, integrating seamlessly with scientific tools such as PyTorch and Open3D. This paper presents ROS2WASM, a novel integration of RoboStack with WebAssembly, enabling the execution of ROS 2 and its associated software directly within web browsers, without requiring local installations. ROS2WASM significantly enhances the reproducibility and shareability of research, lowers barriers to robotics education, and leverages WebAssembly's robust security framework to protect against malicious code. We detail our methodology for cross-compiling ROS 2 packages into WebAssembly, the development of a specialized middleware for ROS 2 communication within browsers, and the implementation of www.ros2wasm.dev, a web platform enabling users to interact with ROS 2 environments. Additionally, we extend support to the Robotics Toolbox for Python and adapt its Swift simulator for browser compatibility. Our work paves the way for unprecedented accessibility in robotics, offering scalable, secure, and reproducible environments that have the potential to transform educational and research paradigms.

15:25-15:30, Paper ThDT13.3	Add to My Program
Prepared for the Worst: Resilience Analysis of the ICP Algorithm Via Learning-Based Worst-Case Adversarial Attacks

Zhang, Ziyu	University of Toronto
Laconte, Johann	French National Research Institute for Agriculture, Food and The
Lisus, Daniil	University of Toronto
Barfoot, Timothy	University of Toronto
Keywords: Localization, Deep Learning Methods, Robot Safety Abstract: This paper presents a novel method for assessing the resilience of the ICP algorithm via learning-based, worst-case attacks on lidar point clouds. For safety-critical applications such as autonomous navigation, ensuring the resilience of algorithms before deployments is crucial. The ICP algorithm is the standard for lidar-based localization, but its accuracy can be greatly affected by corrupted measurements from various sources, including occlusions, adverse weather, or mechanical sensor issues. Unfortunately, the complex and iterative nature of ICP makes assessing its resilience to corruption challenging. While there have been efforts to create challenging datasets and develop simulations to evaluate the resilience of ICP, our method focuses on finding the maximum possible ICP error that can arise from corrupted measurements at a location. We demonstrate that our perturbation-based adversarial attacks can be used pre-deployment to identify locations on a map where ICP is particularly vulnerable to corruptions in the measurements. With such information, autonomous robots can take safer paths when deployed, to mitigate against their measurements being corrupted. The proposed attack outperforms baselines more than 88% of the time across a wide range of scenarios.

15:30-15:35, Paper ThDT13.4	Add to My Program
SLAMSpoof: Practical LiDAR Spoofing Attacks on Localization Systems Guided by Scan Matching Vulnerability Analysis

Nagata, Rokuto	Keio University
Koide, Kenji	National Institute of Advanced Industrial Science and Technology
Hayakawa, Yuki	Keio University
Suzuki, Ryo	Keio University
Ikeda, Kazuma	Keio University
Sako, Ozora	Keio University
Chen, Qi Alfred	University of California, Irvine
Sato, Takami	University of California, Irvine
Yoshioka, Kentaro	Keio University
Keywords: Localization, SLAM, Intelligent Transportation Systems Abstract: Accurate localization is essential for enabling modern full self-driving services. These services heavily rely on map-based traffic information to reduce uncertainties in recognizing lane shapes, traffic light locations, and traffic signs. Achieving this level of reliance on map information requires centimeter-level localization accuracy, which is currently only achievable with LiDAR sensors. However, LiDAR is known to be vulnerable to spoofing attacks that emit malicious lasers against LiDAR to overwrite its measurements. Once localization is compromised, the attack could lead the victim off roads or make them ignore traffic lights. Motivated by these serious safety implications, we design SLAMSpoof, the first practical LiDAR spoofing attack on localization systems for self-driving to assess the actual attack significance on autonomous vehicles. SLAMSpoof can effectively find the effective attack location based on our scan matching vulnerability score (SMVS), a point-wise metric representing the potential vulnerability to spoofing attacks. To evaluate the effectiveness of the attack, we conduct real-world experiments on ground vehicles and confirm its high capability in real-world scenarios, inducing position errors of ≥4.2 meters (more than typical lane width) for all 3 popular LiDAR-based localization algorithms. We finally discuss the potential countermeasures of this attack. Code is available at https://github.com/Keio-CSG/slamspoof

15:35-15:40, Paper ThDT13.5	Add to My Program
Gradient-Based Adversarial Attacks on Deep LiDAR Odometry

Song, Zhenbo	Nanjing University of Science and Technology
Chen, Xuanzhu	Nanjing University of Science and Technology
Zhang, Zhenyuan	Nanjing University of Science and Technology
Zhang, Kaihao	Harbin Institute of Technology, Shenzhen
Lu, Jianfeng	Nanjing University of Science & Technology
Li, Weiqing	Nanjing University of Sci.&Tech
Keywords: Intelligent Transportation Systems, Robot Safety, Deep Learning Methods Abstract: Adversarial attacks have been recently investigated in LiDAR perception problems for autonomous driving, where a small perturbation to the source inputs can result in incorrect predictions. However, most prior studies focus on attacks on single-frame perception modules, lacking explorations of attacks on consecutive-frame tasks, i.e. the LiDAR odometry. In this paper, we propose a gradient optimization-based adversarial attack towards deep LiDAR odometry networks. To generate point clouds consistent with real-world scenarios, we constrain adversarial points within the range of a small object, e.g. a traffic cone, and render new points to simulate real LiDAR measurements. By incorporating such adversarial points in consecutive frames, we demonstrate a significant decrease in pose estimation accuracy of current popular LiDAR odometry networks. In addition, we also evaluate traditional geometric odometry approaches and report their robustness over adversarial points. Extensive experiments on the KITTI and Waymo datasets illustrate the effectiveness of the proposed attack method and the vulnerability of deep LiDAR odometry methods against adversarial points.

15:40-15:45, Paper ThDT13.6	Add to My Program
Enhancing 3D Robotic Vision Robustness by Minimizing Adversarial Mutual Information through a Curriculum Training Approach

Darabi, Nastaran	University of Illinois Chicago
Jayasuriya, Dinithi	University of Illinois Chicago
Naik, Devashri	University of Illinois Chicago
Tulabandhula, Theja	University of Illinois Chicago
Trivedi, Amit Ranjan	University of Illinois at Chicago (UIC), Chicago, USA
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Robot Safety Abstract: Adversarial attacks exploit vulnerabilities in a model's decision boundaries through small, carefully crafted perturbations that lead to significant mispredictions. In 3D vision, the high dimensionality and sparsity of data greatly expand the attack surface, making 3D vision particularly vulnerable for safety-critical robotics. To enhance 3D vision's adversarial robustness, we propose a training objective that simultaneously minimizes prediction loss and mutual information (MI) under adversarial perturbations to contain the upper bound of misprediction errors. This approach simplifies handling adversarial examples compared to conventional methods, which require explicit searching and training on adversarial samples. However, minimizing prediction loss conflicts with minimizing MI, leading to reduced robustness and catastrophic forgetting. To address this, we integrate curriculum advisors in the training setup that gradually introduce adversarial objectives to balance training and prevent models from being overwhelmed by difficult cases early in the process. The advisors also enhance robustness by encouraging training on diverse MI examples through entropy regularizers. We evaluated our method on ModelNet40 and KITTI using PointNet, DGCNN, SECOND, and PointTransformers, achieving 2--5% accuracy gains on ModelNet40 and a 5--10% mAP improvement in object detection. Our code is publicly available at https://github.com/nstrndrbi/Mine-N-Learn.


ThDT14 Regular Session, 402	Add to My Program
End-Effectors

Chair: Hughes, Josie	EPFL
Co-Chair: Tadokoro, Satoshi	Tohoku University

15:15-15:20, Paper ThDT14.1	Add to My Program
PaTS-Wheel: A Passively-Transformable Single-Part Wheel for Mobile Robot Navigation on Unstructured Terrain

Godden, Thomas	Imperial College London
Mulvey, Barry William	Imperial College London
Redgrave, Ellen	Imperial College London
Nanayakkara, Thrishantha	Imperial College London
Keywords: Compliant Joints and Mechanisms, Underactuated Robots Abstract: Most mobile robots use wheels that perform well on even and structured ground, like in factories and warehouses. However, they face challenges traversing unstructured terrain such as stepped obstacles. This paper presents the design and testing of the PaTS-Wheel: a Passively-Transformable Single-part Wheel that can transform to render hooks when presented with obstacles. The passive rendering of this useful morphological feature is guided purely by the geometry of the obstacle. The energy consumption and vibrational profile of the PaTS-Wheel on flat ground is comparable to a standard wheel of the same size. In addition, our novel wheel design (with a diameter of 120 mm) was tested traversing different terrains with stepped obstacles of incremental heights. The PaTS-Wheel achieved 100 % success rate at traversing stepped obstacles 83 mm high (~ 70 % its diameter), higher than the results obtained for an equivalent wheel at 30 mm (~ 25 % its diameter) and an equivalent wheg at 73 mm (~ 61% its diameter). This achieves the design objectives of combining the energy efficiency and ride smoothness of wheels with the obstacle traversal capabilities of legged robots, all without requiring any sensors, actuators, or controllers.

15:20-15:25, Paper ThDT14.2	Add to My Program
Balloon Pin-Array Gripper: Two-Step Shape Adaptation Mechanism for Stable Grasping against Object Misalignment

Kemmotsu, Yuto	Tohoku University
Tadakuma, Kenjiro	Osaka University
Abe, Kazuki	Osaka University
Watanabe, Masahiro	Osaka University
Tadokoro, Satoshi	Tohoku University
Keywords: Compliant Joints and Mechanisms, Grasping, Soft Robot Materials and Design Abstract: This study introduces a balloon pin-array gripper combining shape adaptability to various objects, stable holding by multipoint contact, and isotropic grasping performance. This is particularly useful when the shape or position of the objects cannot be accurately determined because of sensor limitations. This gripper has multiple pins whose tips are covered by flexible balloons. The gripper can adapt to the shapes of objects in two steps: axial sliding of the pins and radial inflation of the balloons. This study focuses on the effect of the layout of pins on grasping and proposes a simulation model to quantify the characteristics of each layout. Simulations showed that the concentric layout enables stable grasping by ensuring many pins contact the object, regardless of misalignment. Experiments using a prototype gripper demonstrated a trend consistent with the simulation results, proving the validity of the simulation model.

15:25-15:30, Paper ThDT14.3	Add to My Program
Adaptive Perching and Grasping by Aerial Robot with Light-Weight and High Grip-Force Tendon-Driven Three-Fingered Hand Using Single Actuator

Iida, Hisaaki	The University of Tokyo
Sugihara, Junichiro	The University of Tokyo
Sugihara, Kazuki	The University of Tokyo
Kozuka, Haruki	The University of Tokyo
Li, Jinjie	The University of Tokyo
Nagato, Keisuke	The University of Tokyo
Zhao, Moju	The University of Tokyo
Keywords: Aerial Systems: Applications, Multifingered Hands, Tendon/Wire Mechanism Abstract: Aerial robots, especially multirotor type, have been utilized in various scenarios such as inspection, surveillance, and logistics. The most critical issue for multirotor type is the limited flight time due to the large power consumption for hovering against gravity. Inspired by nature, various research focus on the perching and grasping ability by deploying a gripper on the multirotor to grasp arboreal environments for saving energy; however, most the mechanical design for gripper restricts the approach path, which significantly limits the performance of perching and grasping. Besides, it is also challenging to design a light gripper that also offers sufficiently large grip force to hang itself. Therefore, in this work, we develop a single-actuator hand for aerial robot that enables adaptive grasping of various objects, and thus can perch from various approach directions. First, we present the design of the lightweight three-fingered hand with a pair of special two-dimensional differential plates that enables the adaptive grasping with a single actuator. In addition, we develop a unique control method for the over-actuated aerial robot equipped with this hand to perform both adaptive pendulum-like perching and detachment. Finally, we demonstrate the feasibility of the prototype hand via load-bearing test and various object grasping tests, along with the inflight perching experiments.

15:30-15:35, Paper ThDT14.4	Add to My Program
CAFEs: Cable-Driven Collaborative Floating End-Effectors for Agriculture Applications

Cheng, Hung Hon	EPFL
Hughes, Josie	EPFL
Keywords: Robotics and Automation in Agriculture and Forestry, Tendon/Wire Mechanism, Actuation and Joint Mechanisms Abstract: CAFEs (Collaborative Agricultural Floating End-effectors) is a new robot design and control approach to automating large-scale agricultural tasks. Based upon a cable driven robot architecture, by sharing the same roller-driven cable set with modular robotic arms, a fast-switching clamping mechanism allows each CAFE to clamp onto or release from the moving cables, enabling both independent and synchronized movement across the workspace. The methods developed to enable this system include the mechanical design, precise position control and a dynamic model for the spring-mass liked system, ensuring accurate and stable movement of the robotic arms. The system's scalability is further explored by studying the tension and sag in the cables to maintain performance as more robotic arms are deployed. Experimental and simulation results demonstrate the system’s effectiveness in tasks including pick-and-place showing its potential to contribute to agricultural automation.

15:35-15:40, Paper ThDT14.5	Add to My Program
A Robotic Finger with a 4-Bar Linkage-Based Compact and Continuously Variable Active Transmission

Chung, Sungho	Sogang University
Sohn, Eugene	Sogang University
Jeong, Seokhwan	Mechanical Eng., Sogang University
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Compliant Joints and Mechanisms Abstract: This paper presents a practical design implementation of 4-bar linkage-based compact and continuously variable active transmission (CCVAT) specifically tailored for the form factor of a robotic finger. The proposed CCVAT aims to solve the two major limitations of conventional linkage-based continuously variable transmission: increased inertia and complexities associated with miniaturization. To counter these limitations, our design incorporates a custom flexible shaft within the joint of the robotic finger, enhancing its adaptability and operational efficiency. In addition, we proposed a cascaded control architecture, combining a disturbance-observer-based low-level controller and a mid-level controller responsible for managing both the transmission ratio and flexion angle of the system. Finally, the feasibility of the prototype was evaluated by conducting several experiments.

15:40-15:45, Paper ThDT14.6	Add to My Program
A Dexterous and Compliant (DexCo) Hand Based on Soft Hydraulic Actuation for Human Inspired Fine In-Hand Manipulation

Zhou, Jianshu	University of California, Berkeley
Junda, Huang	Chinese University of Hong Kong
Dou, Qi	The Chinese University of Hong Kong
Abbeel, Pieter	UC Berkeley
Liu, Yunhui	Chinese University of Hong Kong
Keywords: Dexterous Manipulation, Soft Robot Applications, Multifingered Hands, Grippers and Other End-Effectors Abstract: Human beings possess a remarkable skill for fine in-hand manipulation, utilizing both intra-finger interactions (in-finger) and finger-environment interactions across a wide range of daily tasks. These tasks range from skilled activities like screwing light bulbs, picking and sorting pills, and in-hand rotation, to more complex tasks such as opening plastic bags, cluttered bin picking, and counting cards. Despite its prevalence in human activities, replicating these fine motor skills in robotics remains a substantial challenge. This study tackles the challenge of fine in-hand manipulation by introducing the dexterous and compliant (DexCo) hand system. The DexCo hand mimics human dexterity, replicating the intricate interaction between the thumb, index, and middle fingers, with a contractable palm. The key to maneuverable fine in-hand manipulation lies in its innovative soft hydraulic actuation, which strikes a balance between control complexity, dexterity, compliance, and motion accuracy within a compact structure, enhancing the overall performance of the system. The model of soft hydraulic actuation, based on hydrostatic force analysis, reveals the compliance of hand joints, whic


ThDT15 Regular Session, 403	Add to My Program
Robot Applications

Chair: Yim, Sehyuk	KIST
Co-Chair: Cramariuc, Andrei	ETH Zurich

15:15-15:20, Paper ThDT15.1	Add to My Program
A Minimally Designed Audio-Animatronic Robot

Park, Kyu Min	Sejong University
Cheon, Jeongah	Korea Institute of Science and Technology
Yim, Sehyuk	KIST
Keywords: Mechanism Design, Additive Manufacturing, Tendon/Wire Mechanism, Audio-Driven Motion Generation Abstract: Animatronic robots that simulate lively and realistic motions of creatures can be excellent robotic platforms for social interaction with people. In particular, a robot head is a very important part to express various emotions and generate human-friendly and aesthetic impressions. This article presents Ray, a new type of audio-animatronic robot head. All mechanical structure of the robot is built in one step by 3D printing and has multiple layers expressing the overall shape of a human head and important features such as eyes, nose, mouth, and chin. This simple, lightweight structure and the separate tendon-based actuation system underneath allow for smooth, fast motions of the robot. We also develop an audio-driven motion generation module that automatically synthesizes natural and rhythmic motions of the head and mouth. The developed robot platform is used for various applications for example as a talking robot, robot singer, and robot MC. We expect this research opens up a new paradigm and application possibilities for minimally designed audio-animatronic robots.

15:20-15:25, Paper ThDT15.2	Add to My Program
High Speed Robotic Table Tennis Swinging Using Lightweight Hardware with Model Predictive Control

Nguyen, David	Massachusetts Institute of Technology
Cancio, Kendrick	Massachusetts Institute of Technology
Kim, Sangbae	Massachusetts Institute of Technology
Keywords: Hardware-Software Integration in Robotics, Optimization and Optimal Control, Humanoid Robot Systems Abstract: We present a robotic table tennis platform that achieves a variety of hit styles and ball-spins with high precision, power, and consistency. This is enabled by a custom lightweight, high-torque, low rotor inertia, five degree-of-freedom arm capable of high acceleration. To generate swing trajectories, we formulate an optimal control problem (OCP) that constrains the state of the paddle at the time of the strike. The terminal position is given by a predicted ball trajectory, and the terminal orientation and velocity of the paddle are chosen to match various possible styles of hits: loops (topspin), drives (flat), and chops (backspin). Finally, we construct a fixed-horizon model predictive controller (MPC) around this OCP to allow the hardware to quickly react to changes in the predicted ball trajectory. We validate on hardware that the system is capable of hitting balls with an average exit velocity of 11 m/s at an 88% success rate across the three swing types.

15:25-15:30, Paper ThDT15.3	Add to My Program
Learning Quiet Walking for a Small Home Robot

Watanabe, Ryo	SONY Group
Miki, Takahiro	ETH Zurich
Shi, Fan	National University of Singapore
Kadokawa, Yuki	Nara Institute of Science and Technology
Bjelonic, Filip	ETH Zürich, Switzerland
Kawaharazuka, Kento	The University of Tokyo
Cramariuc, Andrei	ETHZ
Hutter, Marco	ETH Zurich
Keywords: Domestic Robotics, Legged Robots, Reinforcement Learning Abstract: As home robotics gains traction, robots are increasingly integrated into households, offering companionship and assistance. Quadruped robots, particularly those resembling dogs, have emerged as popular alternatives for traditional pets. However, user feedback highlights concerns about the noise these robots generate during walking at home, particularly the loud footstep impact sound. To address this issue, we propose a reinforcement learning (RL) based approach to minimize the foot contact velocity highly related to the footstep sound. Our framework incorporates three key elements: learning varying PD gains to actively dampen and stiffen each joint, utilizing foot contact sensors, and employing curriculum learning to gradually enforce penalties on foot contact velocity. Experiments demonstrate that our learned policy achieves superior quietness compared to a RL baseline and the carefully handcrafted Sony commercial controller baselines. Furthermore, the trade-off between robustness and quietness is shown. This research contributes to developing quieter and more user-friendly robotic companions in home environments.

15:30-15:35, Paper ThDT15.4	Add to My Program
Evaluating Human-Robot Skill Gaps in Electrical Circuit Inspection: A New Electronic Task Board for Benchmarking Manipulation

So, Peter	Technical University of Munich
Swikir, Abdalla	Mohamed Bin Zayed University of Artificial Intelligence
Abu-Dakka, Fares	New York University Abu Dhabi
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Performance Evaluation and Benchmarking, Industrial Robots, Software Tools for Benchmarking and Reproducibility Abstract: Robot manipulation researchers reference human performance as a goal for their work, however, human data is seldom present in robotics benchmarks. We introduce a real-world benchmark targeting manipulation skills for performing electrical circuit inspection with a multimeter using an Internet-connected electronic task board. We present timing study results and an exemplary robot solution across six different tasks from the Robothon Grand Challenge at the automatica conference in 2023. Contributions from 16 robot teams were collected using task boards we manufactured and distributed as part of the 30-day international competition as an initial performance database. Our work systematically highlights the skill gap between the winning robot solution and the best human performance from a group of 30 subjects. Our goal is to chronicle progress over time in robot manipulation skills and provide a standardized, physical benchmark across the global community. Videos of the team submissions, the exemplary robot solution, as well as the project reproduction code are provided in the included repository.

15:35-15:40, Paper ThDT15.5	Add to My Program
RaccoonBot: An Autonomous Wire-Traversing Solar-Tracking Robot for Persistent Environmental Monitoring

Mendez-Flores, Efrain	University of California, Irvine
Pourshahidi, Agaton	University of California, Irvine
Egerstedt, Magnus	University of California, Irvine
Keywords: Hardware-Software Integration in Robotics, Environment Monitoring and Management, Energy and Environment-Aware Automation Abstract: Environmental monitoring is used to characterize the health and relationship between organisms and their environments. In forest ecosystems, robots can serve as platforms to acquire such data, even in hard-to-reach places where wire-traversing platforms are particularly promising due to their efficient displacement. This paper presents the RaccoonBot, which is a novel autonomous wire-traversing robot for persistent environmental monitoring, featuring a fail-safe mechanical design with a self-locking mechanism in case of electrical shortage. The robot also features energy-aware mobility through a novel Solar tracking algorithm, that allows the robot to find a position on the wire to have direct contact with solar power to increase the energy harvested. Experimental results validate the electro-mechanical features of the RaccoonBot, showing that it is able to handle wire perturbations, different inclinations, and achieving energy autonomy.

15:40-15:45, Paper ThDT15.6	Add to My Program
Fast and Accurate Relative Motion Tracking for Dual Industrial Robots

He, Honglu	Rensselaer Polytechnic Institute
Lu, Chen-Lung	Rensselaer Polytechnic Institute
Saunders, Glenn	Rensselaer Polytechnic Institute
Wason, John	Wason Technology, LLC
Yang, Pinghai	GE Research
Schoonover, Jeffrey	GE Research
Ajdelsztajn, Leo	GE
Paternain, Santiago	Rensselaer Polytechnic Institute
Julius, Agung	Rensselaer Polytechnic Institute
Wen, John	Rensselaer Polytechnic Institute
Keywords: Motion and Path Planning, Optimization and Optimal Control, Industrial Robots Abstract: Industrial robotic applications such as spraying, welding, and additive manufacturing frequently require fast, accurate, and uniform motion along a 3D spatial curve. To increase process throughput, some manufacturers propose a dual-arm setup to overcome the speed limitation of a single robot. Industrial robot motion is programmed through waypoints connected by motion primitives (Cartesian linear and circular paths and linear joint paths at constant Cartesian speed). The actual robot motion is affected by the blending between these motion primitives and the pose of the robot (an outstretched/near-singularity pose tends to have larger path-tracking errors). Choosing the waypoints and the speed along each motion segment to achieve the performance requirement is challenging. At present, there is no automated solution, and laborious manual tuning by robot experts is needed to approach the desired performance. In this letter, we present a systematic three-step approach to designing and programming a dual-arm system to optimize system performance. The first step is to select the relative placement between the two robots based on the specified relative motion path. The second step is to select the relative waypoints and the motion primitives. The final step is to update the waypoints iteratively based on the actual measured relative motion. Waypoint iteration is first executed in simulation and then completed using the actual robots. For performance assessment, we use the mean path speed subject to the relative position and orientation constraints and the path speed uniformity constraint. We have demonstrated the effectiveness of this method on two systems, a physical testbed of two ABB robots and a simulation testbed of two FANUC robots, for two challenging test curves.


ThDT16 Regular Session, 404	Add to My Program
Soft Robotics 2

Chair: Chin, Lillian	UT Austin
Co-Chair: Han, Amy Kyungwon	Seoul National University

15:15-15:20, Paper ThDT16.1	Add to My Program
Inflatable-Structure-Based Working-Channel Securing Mechanism for Soft Growing Robots

Seo, Dongoh	Korea Advanced Institute of Science and Technology
Kim, Nam Gyun	Korea Advanced Institute of Science and Technology
Ryu, Jee-Hwan	Korea Advanced Institute of Science and Technology
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Soft Sensors and Actuators Abstract: Soft growing robots are being used in various fields owing to their distinct advantages. However, their ability to manipulate tools in different applications is still challenging. In this paper, we propose an inflatable-structure-based working- channel securing mechanism for soft growing robots. The pro- posed mechanism provides a solution for securing a stable and accessible working channel with pressure equal to the atmospheric pressure, while maintaining the unique advantages of soft growing robots. The proposed soft growing robot can freely transfer materials and tools through its interior channel; therefore, it can adapt and replace equipment based on specific work requirements. This capability enhances the versatility and efficiency of the robot in various applications. Prototyping and experimental validation were conducted to show the performance and capabilities of the robot. The results of the experiments demonstrated that the soft growing robot effectively secured the working channel, enabling the transfer of materials and tools without interference from the inflation pressure. The accessibility of the secured channel was validated through slide-plate and pipe-pulling experiments. The demonstration of the growing mechanism confirmed the ability of the robot to secure a working channel during its growth, whereas the steering demonstration showcased its inherent steering function.

15:20-15:25, Paper ThDT16.2	Add to My Program
Tendon Locking for Antagonistic Configuration and Stiffness-Control in Soft Robots

Licher, Johann	Leibniz University Hannover
Peters, Jan	Leibniz Universität Hannover
Raatz, Annika	Leibniz Universität Hannover
Wurdemann, Helge Arne	University College London
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design Abstract: Some applications, such as surgical interventions, require that potential soft robots have the capability to alter their shape and enhance their force output on demand. This paper presents an antagonistic stiffening mechanism combining pneumatic actuation with tendon locking to achieve configuration- and stiffness control. Elongation of a soft pneumatic section, resulting from air actuation, is opposed by constraining the length of integrated tendons. These tendons can be locked in length by pneumatically activated levers at the base of each segment. Hence, tendon locking will not affect the configuration of other segments of a multi-segment manipulator. Our concept achieves a stiffness increase of up to 201.7% and a larger, more uniform radial workspace compared to the widely used pneumatic actuation concept while maintaining the low technical effort required for actuation. We also demonstrate how our actuation concept enables independent control of stiffness levels for individual segments of a multi-segment manipulator and their MR compatibility.

15:25-15:30, Paper ThDT16.3	Add to My Program
Large-Expansion Bi-Layer Auxetics Create Compliant Cellular Motion

Chin, Lillian	UT Austin
Xie, Gregory	MIT
Lipton, Jeffrey	Northeastern University
Rus, Daniela	MIT
Keywords: Actuation and Joint Mechanisms, Swarm Robotics, Compliant Joints and Mechanisms Abstract: There is significant interest in creating compliant modular robots that can change their volume. Inspired by how biological cells move, these systems can potentially combine the resilience of modular robotics with the increased environmental interactions of soft robotics. However, current versions have limited speed, expansion, and portability. In this paper, we address these concerns through AuxSwarm, a compliant system composed of auxetic-based robotic voxels. These voxels control their volume through a scissor-like bi-layer auxetic design, growing up to 1.57 times their original size in 0.2 seconds. This combination of speed and expansion is unique across modular soft robots, enabling dynamic locomotion capabilities. We characterize the voxels and demonstrate the versatility of this approach through case studies of 2D bending and 3D cube flipping. AuxSwarm provides a first step towards addressable voxel-based smart materials, while simultaneously addressing the robustness and actuation challenges faced by soft robots

15:30-15:35, Paper ThDT16.4	Add to My Program
EViper-2D: A Thin Large-Area Soft Robotics Platform

Cheng, Hsin	Princeton University
Veilleux, Elias	Princeton University
Zheng, Zhiwu	Princeton University
Wagner, Sigurd	Princeton University
Verma, Naveen	Princeton University
Sturm, James	Princeton University
Chen, Minjie	Princeton University
Keywords: Modeling, Control, and Learning for Soft Robots Abstract: This paper presents the key principles of eViper-2D -- a thin large-area soft robotics platform -- as a new development of the previous extendable Vibrating Intelligent Piezo-Electric Robot (eViper) platform. We first introduce the mechanical, electrical, and control framework of eViper-2D, and then develop systematic and scalable methods to study the impact of diverse actuation patterns on robotic motion dynamics and energy efficiency. By integrating power electronics, communication circuits, piezoelectric actuators, and batteries onboard, the eViper-2D platform enables rapid design iteration and quick evaluation of different control strategies for the multi-actuator soft robot. The platform supports data-driven modeling via automated data acquisition. We show that eViper-2D can provide rich insights into optimizing actuation patterns to achieve agile motion and minimal cost of transport (COT).

15:35-15:40, Paper ThDT16.5	Add to My Program
Bio-Inspired Soft Magnetic Swimming Robot for Flexible Motions

Li, Xiaosa	Tsinghua University
Lin, Zenan	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Soft Robot Materials and Design, Modeling, Control, and Learning for Soft Robots, Software-Hardware Integration for Robot Systems Abstract: Bio-inspired soft robots have gained significant attention for their flexible design and adaptability to various environments, making them suitable for exploration and task execution in confined or hazardous areas. However, the deformation and motion of soft magnetic robots rely on both their structural design and magnetization, which complicates the guided movement and balance maintenance for aquatic environments. In this work, inspired by the flat and symmetrical body of rays, we design a soft magnetic fish-shaped robot capable of flexible motions and trajectory swimming on the water surface. This robot features the muscle made of magnetic elastomer, which connects with the acrylic skeleton and silicone film fins with a soft body. In the external magnetic field, the robot achieves hovering by flapping its fins, driven by the magnetically actuated deformation of its magnetic muscle. Besides, the robot's axial magnetization enables the rapid steering guided by a horizontal field. In experiments, the soft magnetic robot was tasked with performing a looping figure-eight trajectory movement on the water surface, guided by the field gradient generated by a dense planar electromagnetic coils' array. When moving, the onboard circuit board of the robot collected its inertial and temperature information, and sent these data to the host computer via Bluetooth in real-time for motion monitoring. Received data demonstrated that our robot performed the specified afloat swimming trajectory, exhibiting a good stability on its yaw angle during the continuous motion. The soft magnetic swimming robot shows its integrated functionalities in untethered actuation, on-robot sensing, and wireless communication, indicating a significant prospect on applications in inspection and cleaning within narrow pipelines and enclosed mechanical interior spaces.

15:40-15:45, Paper ThDT16.6	Add to My Program
Magnetic Programming of Soft Materials Using Digitally Processed Laser Heating

Kocabas, Fatih	University of Wisconsin-Madison
Oguztuzun, Ozan	University of Wisconsin-Madison
Zhou, Youyi	University of Wisconsin-Madison
Alapan, Yunus	University of Wisconsin-Madison
Keywords: Soft Robot Materials and Design Abstract: Spatial programming of magnetic soft materials holds immense potential for wide ranging applications in soft robotics, minimally invasive medicine, and haptic interfaces. Despite tremendous and rapid progress in encoding spatially resolved magnetization directions over soft structures, the currently available approaches employ sequential encoding, resulting in slow and tedious processes with limited throughput. In this paper, we present a rapid and parallel magnetic programming strategy based on digitally processed laser heating. Heating above the Curie temperature of the magnetic microparticles embedded within the soft material allows their facile magnetization in desired directions via small external magnetic fields. To achieve parallel and rapid magnetic programming, we developed an integrated digital laser processing and magnetic field generation system, facilitating generation of desired shapes and patterns at high-resolution. Performance of the pattern generation and magnetic soft material are experimentally evaluated. Employing the described magnetic programming framework, shape-morphing of magnetic soft structures with varying magnetic profiles are shown. The proposed approach establishes a rapid and facile encoding procedure with high-throughput magnetic programming potential.

15:45-15:50, Paper ThDT16.7	Add to My Program
Proprioceptive State Estimation for Amphibious Tactile Sensing

Han, Xudong	Southern University of Science and Technology
Guo, Ning	Southern University of Science and Technology
Zhong, Shuqiao	Southern University of Science and Technology
Zhou, Zhiyuan	Southern University of Science and Technology
Lin, Jian	Southern University of Science and Technology
Song, Chaoyang	Southern University of Science and Technology
Wan, Fang	Southern University of Science and Technology
Keywords: Modeling, Control, and Learning for Soft Robots, Computer Vision for Other Robotic Applications, Grasping, Proprioceptive State Estimation Abstract: This paper presents a novel vision-based proprioception approach for a soft robotic finger that can estimate and reconstruct tactile interactions in terrestrial and aquatic environments. The key to this system lies in the finger's unique metamaterial structure, which facilitates omni-directional passive adaptation during grasping, protecting delicate objects across diverse scenarios. A compact in-finger camera captures high-framerate images of the finger's deformation during contact, extracting crucial tactile data in real time. We present a volumetric discretized model of the soft finger and use the geometry constraints captured by the camera to find the optimal estimation of the deformed shape. The approach is benchmarked using a motion capture system with sparse markers and a haptic device with dense measurements. Both results show state-of-the-art accuracies, with a median error of 1.96 mm for overall body deformation, corresponding to 2.1% of the finger's length. More importantly, the state estimation is robust in both on-land and underwater environments, as we demonstrate its usage for underwater object shape sensing. This combination of passive adaptation and real-time tactile sensing paves the way for amphibious robotic grasping applications. All codes are shared on GitHub: https://github.com/ancorasir/PropSE.


ThDT17 Regular Session, 405	Add to My Program
Planning with Contact

Chair: Lozano-Perez, Tomas	MIT
Co-Chair: Stueckler, Joerg	University of Augsburg

15:15-15:20, Paper ThDT17.1	Add to My Program
Fast Contact-Implicit Model Predictive Control

Le Cleac'h, Simon	Stanford University
Howell, Taylor	Stanford University
Yang, Shuo	Carnegie Mellon University
Lee, Chi-Yen	Carnegie Mellon University
Zhang, John	Carnegie Mellon University
Bishop, Arun	Carnegie Mellon University
Schwager, Mac	Stanford University
Manchester, Zachary	Carnegie Mellon University
Keywords: Optimization and Optimal Control, Model Predictive Control, Legged Robots, Motion Control Abstract: We present a general approach for controlling robotic systems that make and break contact with their environments. Contact-implicit model predictive control (CI-MPC) generalizes linear MPC to contact-rich settings by utilizing a bi-level planning formulation with lower-level contact dynamics formulated as time-varying linear complementarity problems (LCPs) computed using strategic Taylor approximations about a reference trajectory. These dynamics enable the upper-level planning problem to reason about contact timing and forces, and generate entirely new contact-mode sequences online. To achieve reliable and fast numerical convergence, we devise a structure-exploiting interior-point solver for these LCP contact dynamics and a custom trajectory optimizer for the tracking problem. We demonstrate real-time solution rates for CI-MPC and the ability to generate and track non-periodic behaviours in hardware experiments on a quadrupedal robot. We also show that the controller is robust to model mismatch and can respond to disturbances by discovering and exploiting new contact modes across a variety of robotic systems in simulation, including a pushbot, planar hopper, planar quadruped, and planar biped.

15:20-15:25, Paper ThDT17.2	Add to My Program
Robo-GS: A Physics Consistent Spatial-Temporal Model for Robotic Arm with Hybrid Representation

Lou, Haozhe	University of Southern California
Liu, Yurong	Beijing Institute of Technology
Pan, Yike	University of Michigan
Geng, Yiran	Peking University
Chen, Jianteng	Hong Kong University of Science and Technology
Ma, Wenlong	Beijing Institute of Technology
Li, Chenglong	Beijing Institute of Technology
Wang, Lin	Beijing Institute of Technology
Feng, Hengzhen	Beijing Institute of Technology
Shi, Lu	Tsinghua University
Shi, Yongliang	Tsinghua University
Keywords: Simulation and Animation, Methods and Tools for Robot System Design, Software Architecture for Robotic and Automation Abstract: Real2Sim2Real plays a critical role in robotic arm control and reinforcement learning, yet bridging this gap remains a significant challenge due to the complex physical properties of robots and the objects they manipulate. Existing methods lack a comprehensive solution to accurately reconstruct real-world objects with spatial representations and their associated physics attributes. We propose a Real2Sim pipeline with a hybrid representation model that integrates mesh geometry, 3D Gaussian kernels, and physics attributes to enhance the digital asset representation of robotic arms. This hybrid representation is implemented through a Gaussian-Mesh-Pixel binding technique, which establishes an isomorphic mapping between mesh vertices and Gaussian models. This enables a fully differentiable rendering pipeline that can be optimized through numerical solvers, achieves high-fidelity rendering via Gaussian Splatting, and facilitates physically plausible simulation of the robotic arm's interaction with its environment using mesh-based methods. Given the digital assets, we propose a manipulable Real2Sim pipeline that standardizes coordinate systems and scales, ensuring the seamless integration of multiple components. In addition to reconstructing the robotic arm, the surrounding static background and objects can be holistically reconstructed, enabling seamless interactions between the robotic arm and its environment. We also provide datasets covering various robotic manipulation tasks and robotic arm mesh reconstructions. These datasets include real-world motion captured in digital assets, ensuring precise representation of mass and friction, which are crucial for robotic manipulation. Our model achieves state-of-the-art results in realistic rendering and mesh reconstruction quality for robotic applications.

15:25-15:30, Paper ThDT17.3	Add to My Program
One-Shot Manipulation Strategy Learning by Making Contact Analogies

Liu, Yuyao	Tsinghua University
Mao, Jiayuan	MIT
Tenenbaum, Joshua	Massachusetts Institute of Technology
Lozano-Perez, Tomas	MIT
Kaelbling, Leslie	MIT
Keywords: Integrated Planning and Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration Abstract: We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: https://magic-2024.github.io/.

15:30-15:35, Paper ThDT17.4	Add to My Program
Incremental Few-Shot Adaptation for Non-Prehensile Object Manipulation Using Parallelizable Physics Simulators

Baumeister, Fabian	Max Planck Institute for Intelligent Systems
Mack, Lukas	University of Augsburg
Stueckler, Joerg	University of Augsburg
Keywords: Incremental Learning, Integrated Planning and Learning, Learning from Experience Abstract: Few-shot adaptation is an important capability for intelligent robots that perform tasks in open-world settings such as everyday environments or flexible production. In this paper, we propose a novel approach for non-prehensile manipulation which incrementally adapts a physics-based dynamics model for model-predictive control (MPC). The model prediction is aligned with a few examples of robot-object interactions collected with the MPC. This is achieved by using a parallelizable rigid-body physics simulation as dynamic world model and sampling-based optimization of the model parameters. In turn, the optimized dynamics model can be used for MPC using efficient sampling-based optimization. We evaluate our few-shot adaptation approach in object pushing experiments in simulation and with a real robot.

15:35-15:40, Paper ThDT17.5	Add to My Program
Efficient Gradient-Based Inference for Manipulation Planning in Contact Factor Graphs

Lee, Jeongmin	Seoul National University
Park, Sunkyung	Seoul National University
Lee, Minji	Seoul National University
Lee, Dongjun	Seoul National University
Keywords: Manipulation Planning, Contact Modeling, Dexterous Manipulation Abstract: This paper presents a framework designed to tackle a range of planning problems arise in manipulation, which typically involve complex geometric-physical reasoning related to contact and dynamic constraints. We introduce the Contact Factor Graph (CFG) to graphically model these diverse factors, enabling us to perform inference on the graphs to approximate the distribution and sample appropriate solutions. We propose a novel approach that can incorporate various phenomena of contact manipulation as differentiable factors, and develop an efficient inference algorithm for CFG that leverages this differentiability along with the conditional probabilities arising from the structured nature of contact. Our results demonstrate the capability of our framework in generating viable samples and approximating posterior distributions for various manipulation scenarios.

15:40-15:45, Paper ThDT17.6	Add to My Program
Polyhedral Collision Detection Via Vertex Enumeration

Cinar, Andrew	Vanderbilt University
Zhao, Yue	Vanderbilt University
Laine, Forrest	Vanderbilt University
Keywords: Collision Avoidance, Constrained Motion Planning Abstract: Collision detection is a critical functionality for robotics. The degree to which objects collide cannot be represented as a continuously differentiable function for any shapes other than spheres. This paper proposes a framework for handling collision detection between polyhedral shapes. We frame the signed distance between two polyhedral bodies as the optimal value of a convex optimization, and consider constraining the signed distance in a bilevel optimization problem. To avoid relying on specialized bilevel solvers, our method exploits the fact that the signed distance is the minimal point of a convex region related to the two bodies. Our method enumerates the values obtained at all extreme points of this region and lists them as constraints in the higher-level problem. We compare our formulation to existing methods in terms of accuracy and speed when solved using the same mixed complementarity problem solver. We demonstrate that our approach more reliably solves difficult collision detection problems with multiple obstacles than other methods, and is faster than existing methods in some cases.

15:45-15:50, Paper ThDT17.7	Add to My Program
Flying Calligrapher: Contact-Aware Motion and Force Planning and Control for Aerial Manipulation

Guo, Xiaofeng	Carnegie Mellon Univeristy
He, Guanqi	Carnegie Mellon University
Xu, Jiahe	Carnegie Mellon University
Mousaei, Mohammadreza	Carnegie Mellon University
Geng, Junyi	Pennsylvania State University
Scherer, Sebastian	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Integrated Planning and Control Abstract: Aerial manipulation has gained interest in completing high-altitude tasks that are challenging for human workers, such as contact inspection and defect detection, etc. Previous research has focused on maintaining static contact points or forces. This letter addresses a more general and dynamic task: simultaneously tracking time-varying contact force in the surface normal direction and motion trajectories on tangential surfaces. We propose a pipeline that includes a contact-aware trajectory planner to generate dynamically feasible trajectories, and a hybrid motion-force controller to track such trajectories. We demonstrate the approach in an aerial calligraphy task using a novel sponge pen design as the end-effector, whose stroke width is positively related to the contact force. Additionally, we develop a touchscreen interface for flexible user input. Experiments show our method can effectively draw diverse letters, achieving an IoU of 0.59 and an end-effector position (force) tracking RMSE of 2.9 cm (0.7 N). Website: https://xiaofeng-guo.github.io/flying-calligrapher/.


ThDT18 Regular Session, 406	Add to My Program
Imaging, Scanning, Localization

Chair: Jiang, Zhongliang	Technical University of Munich
Co-Chair: Huang, Baoru	Imperial College London

15:15-15:20, Paper ThDT18.1	Add to My Program
Autonomous Robotic Ultrasound Approach for Fetoscope Tracking by Fusing Optical and 2D Ultrasound Data

Cai, Yuyu	KU Leuven
Li, Ruixuan	KU Leuven
Davoodi, Ayoob	Katholieke Universiteit Leuven(KU Leuven)
Ourak, Mouloud	University of Leuven
Deprest, Jan	University Hospital Leuven
Vander Poorten, Emmanuel B	KU Leuven
Keywords: Medical Robots and Systems, Sensor Fusion Abstract: 2D ultrasound (US) guidance is an essential tool in fetoscopic laser photocoagulation (FLP) to treat twin-to-twin transfusion syndrome (TTTS). During the procedure, the sonographer and endoscopic surgeon manage different image modalities each with its own field of view. Tacit collaboration is needed between them to visualize the right information and ensure the smooth operation of the procedure. Robotic approaches could simplify this interaction but would require robust localization tools to cope with the complex fetoscopic motion patterns. This study proposes a method for robotic ultrasound (rUS) fetoscope tracking, fusing optical tracking system (OTS) and 2D US imaging. The Kalman filter is defined to guarantee robust online registration and enhance fetoscope tracking. Real-time detection of the fetoscope tip is achieved using the You Only Look Once (YOLO v7) algorithm. Additionally, a US image-based searching strategy is proposed for situations where the optical camera is obstructed. Hybrid position-force control is employed to manipulate the US probe safely against the pregnant abdomen. Validation on a silicone phantom demonstrates accurate tracking results with a mean error below 2.59 mm and tip visibility exceeding 90% is found in most experiments. The proposed system could potentially reduce surgeon workload and training costs for FLP surgery.

15:20-15:25, Paper ThDT18.2	Add to My Program
Guiding the Last Centimeter: Novel Anatomy-Aware Probe Servoing for Standardized Imaging Plane Navigation in Robotic Lung Ultrasound (I)

Ma, Xihan	Worcester Polytechnic Institute
Zeng, Mingjie	Worcester Polytechnic Institute
Hill, Jeffrey C.	MCPHS University
Hoffmann, Beatrice	Beth Israel Deaconess Medical Center
Zhang, Ziming	Worcester Polytechnic Institute
Zhang, Haichong	Worcester Polytechnic Institute
Keywords: Medical Robots and Systems, Visual Servoing, Object Detection, Segmentation and Categorization Abstract: Navigating the ultrasound (US) probe to the standardized imaging plane (SIP) for image acquisition is a critical but operator-dependent task in conventional freehand diagnostic US. Robotic US systems (RUSS) offer the potential to enhance imaging consistency by leveraging real-time US image feedback to optimize the probe pose, thereby reducing reliance on operator expertise. However, determining the proper approach to extracting generalizable features from the US images for probe pose adjustment remains challenging. In this work, we propose a SIP navigation framework for RUSS, exemplified in the context of robotic lung ultrasound (LUS). This framework facilitates automatic probe adjustment when in proximity to the SIP. This is achieved by explicitly extracting multiple anatomical features presented in real-time LUS images and performing non-patient-specific template matching to generate probe motion towards the SIP using image-based visual servoing (IBVS). The framework is further integrated with the active-sensing end-effector (A-SEE), a customized robot end-effector that leverages patient external body geometry to maintain optimal probe alignment with the contact surface, thus preserving US signal quality throughout the navigation. The proposed approach ensures procedural interpretability and inter-patient adaptability. Validation is conducted through anatomy-mimicking phantom and in-vivo evaluations involving five human subjects.

15:25-15:30, Paper ThDT18.3	Add to My Program
Automatic Robotic-Assisted Diffuse Reflectance Spectroscopy Scanning System

Deng, Kaizhong	Imperial College London
Peters, Christopher	Imperial College London
Mylonas, George	Imperial College London
Elson, Daniel	Imperial College London
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Visual Servoing Abstract: Diffuse Reflectance Spectroscopy (DRS) is a well-established optical technique for tissue composition assessment which has been validated for tumour detection to ensure the complete removal of cancerous tissue. While point-wise assessment has many potential applications, incorporating automated large-area scanning would enable holistic tissue sampling with higher consistency. We propose a robotic system to facilitate autonomous DRS scanning with hybrid visual servoing control. A specially designed height compensation module enables precise contact condition control. The evaluation results show that the system can accurately execute the scanning command and acquire consistent DRS spectra with comparable results to the manual collection, which is the current gold standard protocol. Integrating the proposed system into surgery lays the groundwork for autonomous intra-operative DRS tissue assessment with high reliability and repeatability. This could reduce the need for manual scanning by the surgeon while ensuring complete tumor removal in clinical practice.

15:30-15:35, Paper ThDT18.4	Add to My Program
Robust and Accurate Multi-View 2D/3D Image Registration with Differentiable X-Ray Rendering and Dual Cross-View Constraints

Cui, Yuxin	Shandong University
Min, Zhe	University College London
Song, Rui	Shandong University
Li, Yibin	Shandong University
Meng, Max Q.-H.	The Chinese University of Hong Kong
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Robust and accurate 2D/3D registration, which aligns the preoperative model and the intraoperative image of the same anatomy, plays an important role in enabling successful interventional navigation. To alleviate the challenge of limited field of view associated with single intraoperative image, more than one intraoperative images can be leveraged and the multi-view 2D/3D registration is thus needed. In this paper, we propose a novel multi-view 2D/3D rigid registration approach which consists of two stages. In the first stage, the combined loss function consisting of the differences between the predicted and ground-truth poses, and dissimilarities (e.g., normalized crosscorrelation) between the simulated and observed intraoperative images. More importantly, the additional cross-view training loss terms are formulated for both pose and image loss, to explicitly consider the cross-view constraints. In the second stage, the test-time optimization is conducted to refine the estimated poses in the coarse stage. Our method leverages the mutual constraints of multi-frame view projection poses to enhance the robustness of the multi-view 2D/3D registration approach. The proposed framework achieves an mTRE of 0.79±2.17 mm on six datasets from DeepFluoro, further advancing beyond the state-of-the-art registration algorithms on this dataset.

15:35-15:40, Paper ThDT18.5	Add to My Program
Robust Robotic Breast Ultrasound Scanning and Real-Time Lesion Localization

Cao, Zhiyan	Huazhong University of Science and Technology
Wang, Yiwei	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Zhang, Shaohua	Huazhong University of Science and Technology
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: The inherent flexibility and real-time deformation of breast tissue pose significant challenges for achieving full coverage and accurate lesion localization in autonomous breast ultrasound scanning. This paper introduces a robust finite state machine-based framework that mimics the decision-making process of an experienced physician, dynamically transitioning between global breast scan and fine lesion scan. An autonomous radial and anti-radial global scan pattern ensures comprehensive breast coverage. To avoid lesion misidentification caused by soft tissue movement, a real-time lesion fine scan method is proposed for lesion detection and localization. Experimental results demonstrate that our system in full coverage tests achieves 7 identified lesions out of 7 existing lesions and maintains a robust localization accuracy of 3.23 mm across phantoms with varying stiffnesses.

15:40-15:45, Paper ThDT18.6	Add to My Program
Hybrid Deep Reinforcement Learning for Radio Tracer Localisation in Robotic-Assisted Radioguided Surgery

Zhang, Hanyi	Imperial College London
Deng, Kaizhong	Imperial College London
Hu, Zhaoyang Jacopo	Imperial College London
Huang, Baoru	Imperial College London
Elson, Daniel	Imperial College London
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Reinforcement Learning Abstract: Radioguided surgery, such as sentinel lymph node biopsy, relies on the precise localization of radioactive targets by non-imaging gamma/beta detectors. Manual radioactive target detection based on visual display or audible indication of gamma level is highly dependent on the ability of the surgeon to track and interpret the spatial information. This paper presents a learning-based method to realize the autonomous radiotracer detection in robot-assisted surgeries by navigating the probe to the radioactive target. We proposed novel hybrid approach that combines deep reinforcement learning (DRL) with adaptive robotic scanning. The adaptive grid-based scanning could provide initial direction estimation while the DRL-based agent could efficiently navigate to the target utilising historical data. Simulation experiments demonstrate a 95% success rate, and improved efficiency and robustness compared to conventional techniques. Real-world evaluation on the da Vinci Research Kit (dVRK) further confirms the feasibility of the approach, achieving an 80% success rate in radiotracer detection. This method has the potential to enhance consistency, reduce operator dependency, and improve procedural accuracy in radioguided surgeries.

15:45-15:50, Paper ThDT18.7	Add to My Program
Improving Probe Localization for Freehand 3D Ultrasound Using Lightweight Cameras

Huang, Dianye	Technical University of Munich
Navab, Nassir	TU Munich
Jiang, Zhongliang	Technical University of Munich
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics Abstract: Ultrasound (US) probe localization relative to the examined subject is essential for freehand 3D US imaging, which offers significant clinical value due to its affordability and unrestricted field of view. However, existing methods often rely on expensive tracking systems or bulky probes, while recent US image-based deep learning methods suffer from accumulated errors during probe maneuvering. To address these challenges, this study proposes a versatile, cost-effective probe pose localization method for freehand 3D US imaging, utilizing two lightweight cameras. To eliminate accumulated errors during US scans, we introduce PoseNet, which directly predicts the probe's 6D pose relative to a preset world coordinate system based on camera observations. We first jointly train pose and camera image encoders based on pairs of 6D pose and camera observations densely sampled in simulation. This will encourage each pair of probe pose and its corresponding camera observation to share the same representation in latent space. To ensure the two encoders handle unseen images and poses effectively, we incorporate a triplet loss that enforces smaller differences in latent features between nearby poses compared to distant ones. Then, the pose decoder uses the latent representation of the camera images to predict the probe's 6D pose. To bridge the sim-to-real gap, in the real world, we use the trained image encoder and pose decoder for initial predictions, followed by an additional MLP layer to refine the estimated pose, improving accuracy. The results obtained from an arm phantom demonstrate the effectiveness of the proposed method, which notably surpasses state-of-the-art techniques, achieving average positional and rotational errors of 2.03 mm and 0.37 deg, respectively.


ThDT19 Regular Session, 407	Add to My Program
Manufacturing and Assembly Processes

Chair: Fox, Dieter	University of Washington
Co-Chair: Fang, Kuan	Cornell University

15:15-15:20, Paper ThDT19.1	Add to My Program
Robot-Based Automatic Charging for Electric Vehicles Using Incremental Learning and Biomimetic Control

Zeng, Chao	University of Liverpool
Ye, Dexi	South China University of Technology
Wang, Ning	Sheffield Hallam University
Feng, Chen	Zhejiang VIE Science & Technology Co., Ltd
Yang, Chenguang	University of Liverpool
Keywords: Compliant Assembly, Incremental Learning, Compliance and Impedance Control Abstract: With the growing popularity of electric vehicles, the demand for robot-based unmanned automatic charging has become both urgent and challenging. Two key challenges need to be addressed: how to efficiently locate the charging port, and how to compliantly insert the connector into the port. In this paper, we propose an incremental learning method based on the broad learning system to address the visual positioning error of the charging port. This method allows the robot to transfer and generalize the search skills learned in simulation to real-world scenarios. As a result, the robot can rapidly locate the charging port in real world environments without the need for complex contact state modeling, time-consuming data collection, or model retraining. Subsequently, a biomimetic admittance controller is designed to enable the robot to adapt its compliant behavior online during the plugging process. Finally, experiments are performed on a UR robot to verify the effectiveness of our method.

15:20-15:25, Paper ThDT19.2	Add to My Program
CC-STAR: An Estimation for Contact State Transition Using Reconstruction-Based Anomaly Detection for Peg-In-Hole Assembly

Lee, Haeseong	Graduate School of Convergence Science and Technology, Seoul Nat
Sung, Eunho	Seoul National University
You, Seungbin	Seoul National University
Park, Jaeheung	Seoul National University
Keywords: Assembly, Intelligent and Flexible Manufacturing, AI-Enabled Robotics Abstract: For successful peg-in-hole assembly, predefined sub-tasks should be executed sequentially according to the current contact state. Therefore, recognizing contact state transitions is essential in order to determine whether to continue the current task or proceed to the next. In that context, learning-based solutions have shown outstanding results. However, these methods heavily rely on balanced datasets, which are challenging to obtain due to the short duration of certain contact states and rare failure cases. To address this issue, this paper proposes a framework for estimating contact state transitions using anomaly detection through input data reconstruction. The proposed framework operates in a semi-supervised manner, eliminating the need for balanced datasets during training. For input data reconstruction, a convolutional neural network is combined with a variational autoencoder to process various sensor measurements as a multivariate time series. Unlike traditional binary anomaly detection, the proposed anomaly detector scores reconstruction errors and leverages domain knowledge to identify various contact state transitions in the peg-in-hole assembly. The effectiveness of the proposed framework is validated through experiments using a torque-controlled dual manipulator system.

15:25-15:30, Paper ThDT19.3	Add to My Program
Blox-Net: Generative Design-For-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

Goldberg, Andrew	University of California Berkeley
Kondap, Kavish	University of California, Berkeley
Qiu, Tianshuang	University of California, Berkeley
Ma, Zehan	University of California, Berkeley
Fu, Letian	UC Berkeley
Kerr, Justin	University of California, Berkeley
Huang, Huang	University of California at Berkeley
Chen, Kaiyuan	University of California, Berkeley
Fang, Kuan	Cornell University
Goldberg, Ken	UC Berkeley
Keywords: Assembly, Robotics and Automation in Construction, AI-Based Methods Abstract: Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the importance of research in industrial Design for Assembly, we introduce a novel problem: Generative Design-for-Robot- Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., “giraffe”) and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, accompanied by instructions for a robot to build it. The output geometry must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that com- bines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems without human supervision. Blox-Net achieved a Top-1 accuracy of 63.5% in the semantic accuracy of its designed assemblies. Six designs, after Blox-Net’s automated pertubation redesign, were reliably assembled by a robot, achieving near-perfect success across 10 consecutive assembly iterations with human intervention only during reset prior to assembly. The entire pipeline from the textual word to reliable physical assembly is performed without human intervention.

15:30-15:35, Paper ThDT19.4	Add to My Program
Geometry and Force-Informed Robotic Assembly with Small Relative Initial Deviations for Circular Electrical Connectors

Wang, Zhenyu	Huazhong University of Science and Technology
Li, Xiangfei	Huazhong University of Science and Technology
Zhao, Huan	Huazhong University of Science and Technology
Shao, Lingjun	Huazhong University of Science and Technology
Zhang, Hao	Huazhong University of Science and Technology
Ding, Han	Huazhong University of Science and Technology
Keywords: Assembly, Compliant Assembly Abstract: Circular electrical connectors (CECs) have a wide range of applications in scenarios that require reliable connections. However, sockets are often located in narrow scenes with random spatial orientations, complex lighting conditions, and obstructions from cables, making it difficult to accurately locate them through cameras. Besides, due to the complex geometric structure of CECs and the presence of electrode protection slots, the existing research on the assembly of cylindrical or polygonal pegs and holes may not be applicable to the assembly of such components. To this end, this article proposes a novel robotic assembly strategy for CECs with small relative initial deviations, whose core is to design a search trajectory and heuristic force strategy to perceive force/pose (F/P) discontinuity characteristics under different geometric constraints. This assembly strategy is independent of the CEC's size and is not affected by the socket's spatial orientation. The experiments with two different sizes of CECs on a robot equipped with a 6-dimensional force/torque (F/T) sensor are conducted, and the effectiveness and robustness of the proposed assembly strategy for CECs are demonstrated.

15:35-15:40, Paper ThDT19.5	Add to My Program
MatchMaker: Automated Asset Generation for Robotic Assembly

Wang, Yian	Umass Amherst
Tang, Bingjie	University of Southern California
Gan, Chuang	IBM
Fox, Dieter	University of Washington
Mo, Kaichun	NVIDIA
Narang, Yashraj	NVIDIA
Akinola, Iretiayo	Columbia University
Keywords: Assembly, AI-Enabled Robotics, Computer Vision for Manufacturing Abstract: Robotic assembly remains a significant challenge due to complexities in visual perception, functional grasping, contact-rich manipulation, and performing high-precision tasks. Simulation-based learning and sim-to-real transfer have led to recent success in solving assembly tasks in the presence of object pose variation, perception noise, and control error; however, the development of a generalist (i.e., multi-task) agent for a broad range of assembly tasks has been limited by the need to manually curate assembly assets, which greatly constrains the number and diversity of assembly problems that can be used for policy learning. Inspired by recent success of using Generative AI to scale up robot learning, we propose MatchMaker, a pipeline to automatically generate diverse, simulation-compatible assembly asset pairs to facilitate learning assembly skills. Specifically, MatchMaker can 1) take a simulation-incompatible, interpenetrating asset pair as input, and automatically convert it into a simulation-compatible, interpenetration-free pair, 2) take an arbitrary single asset as input , and generate a geometrically-mating asset to create an asset pair, 3) automatically erode contact surfaces from (1) or (2) according to a user-specified clearance parameter to generate realistic parts.

15:40-15:45, Paper ThDT19.6	Add to My Program
CNSv2: Probabilistic Correspondence Encoded Neural Image Servo

Chen, Anzhe	Zhejiang University
Yu, Hongxiang	Zhejiang University
Li, Shuxin	Zhejiang University
Chen, Yuxi	Zhejiang University
Zhou, Zhongxiang	Zhejiang University
Sun, WenTao	Beijing Institute of Technology
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Assembly, Intelligent and Flexible Manufacturing, Visual Servoing Abstract: Visual servo based on traditional image matching methods often requires accurate keypoint correspondence for high precision control. However, keypoint detection or matching tends to fail in challenging scenarios with inconsistent illuminations or textureless objects, resulting significant performance degradation. Previous approaches, including our proposed Correspondence encoded Neural image Servo policy (CNS), attempted to alleviate these issues by integrating neural control strategies. While CNS shows certain improvement against error correspondence over conventional image-based controllers, it could not fully resolve the limitations arising from poor keypoint detection and matching. In this paper, we continue to address this problem and propose a new solution: Probabilistic Correspondence Encoded Neural Image Servo (CNSv2). CNSv2 leverages probabilistic feature matching to improve robustness in challenging scenarios. By redesigning the architecture to condition on multimodal feature matching, CNSv2 achieves high precision, improved robustness across diverse scenes and runs in real-time. We validate CNSv2 with simulations and real-world experiments, demonstrating its effectiveness in overcoming the limitations of detector-based methods in visual servo tasks.

15:45-15:50, Paper ThDT19.7	Add to My Program
Supervised Representation Learning towards Generalizable Assembly State Recognition

Schoonbeek, Tim Jeroen	Eindhoven University of Technology
Balachandran, Goutham	ASML
Onvlee, Hans	ASML
Houben, Tim	Eindhoven University of Technology
Hung, Shao-Hsuan	Eindhoven University of Technology
Kustra, Jacek	ASML
de With, Peter H.N.	Eindhoven University of Technology
van der Sommen, Fons	Eindhoven University of Technology
Keywords: Representation Learning, Computer Vision for Manufacturing, Deep Learning Methods Abstract: Assembly state recognition facilitates the execution of assembly procedures, offering feedback to enhance efficiency and minimize errors. However, recognizing assembly states poses challenges in scalability, since parts are frequently updated, and the robustness to execution errors remains underexplored. To address these challenges, this paper proposes an approach based on representation learning and the novel intermediate-state informed loss function modification (ISIL). ISIL leverages unlabeled transitions between states and demonstrates significant improvements in clustering and classification performance for all tested architectures and losses. Despite being trained exclusively on images without execution errors, thorough analysis on error states demonstrates that our approach accurately distinguishes between correct states and states with various types of execution errors. The integration of the proposed algorithm can offer meaningful assistance to workers and mitigate unexpected losses due to procedural mishaps in industrial settings. The code and data are publicly available.


ThDT20 Regular Session, 408	Add to My Program
Agricultural Automation 3

Chair: Papageorgiou, Dimitrios	Hellenic Mediterranean University
Co-Chair: Berenson, Dmitry	University of Michigan

15:15-15:20, Paper ThDT20.1	Add to My Program
Panoptic Segmentation with Partial Annotations for Agricultural Robots

Weyler, Jan	University of Bonn
Läbe, Thomas	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Keywords: Robotics and Automation in Agriculture and Forestry, Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: A detailed analysis of agricultural fields is key toward reducing the use of agrochemicals to achieve a more sustainable crop production. To this end, agricultural robots equipped with vision-based systems offer the potential to detect individual plants in the field automatically. This capability enables targeted management actions in the field, effectively reducing the amount of agrochemicals. A primary target of such vision systems is to perform a panoptic segmentation, combining the task of semantic and instance segmentation. Recent methods use neural networks for this task, which typically have to be trained on densely annotated images containing the required ground truth information for each pixel. Gathering these dense annotations is generally daunting and requires domain experts' knowledge in the agricultural domain. In this paper, we propose a method to effectively reduce the annotation bottleneck and yet achieve high performance using partial annotations. These partial annotations contain ground truth information only for a subset of pixels per image and are thus much faster to obtain than dense annotations. We propose a novel set of losses that exploit measures from vector fields used in physics, i.e., divergence and curl, to effectively supervise predictions without ground truth annotations. The experimental evaluation shows that our approach outperforms several state-of-the-art methods targeting to reduce the amount of annotations.

15:20-15:25, Paper ThDT20.2	Add to My Program
Robotic 3D Flower Pose Estimation for Small-Scale Urban Farms

Muriki, Venkata Harsh Suhith	Georgia Institute of Technology
Teo, Hong Ray	Georgia Institute of Technology
Sengupta, Ved	Georgia Tech Research Institute
Hu, Ai-Ping	Georgia Tech Research Institute
Keywords: Robotics and Automation in Agriculture and Forestry, Computer Vision for Automation Abstract: The small scale of urban farms and the commercial availability of low-cost robots (such as the FarmBot) that automate simple tending tasks enable an accessible platform for plant phenotyping. We have used a FarmBot with a custom camera end-effector to estimate strawberry plant flower pose (for robotic pollination) from acquired 3D point cloud models. We describe a novel algorithm that translates individual occupancy grids along orthogonal axes of a point cloud to obtain 2D images corresponding to the six viewpoints. For each image, 2D object detection models for flowers are used to identify 2D bounding boxes which can be converted into the 3D space to extract flower point clouds. Pose estimation is performed by fitting three shapes (superellipsoids, paraboloids and planes) to the flower point clouds and compared with manually labeled ground truth. Our method successfully finds approximately 80% of flowers scanned using our customized FarmBot platform and has a mean flower pose error of 7.7 degrees, which is sufficient for robotic pollination and rivals previous results. All code will be made available at https://github.com/harshmuriki/flowerPose.git.

15:25-15:30, Paper ThDT20.3	Add to My Program
Fault Management System for the Safety of Perception Systems in Highly Automated Agricultural Machines

Lee, Changjoo	Technical University of Munich
Schätzle, Simon	STW (Sensor-Technik Wiedemann GmbH)
Lang, Stefan Andreas	Sensor-Technik Wiedemann
Maier, Michael	Technical University of Munich
Oksanen, Timo	Technical University of Munich
Keywords: Robotics and Automation in Agriculture and Forestry, Robot Safety, Deep Learning for Visual Perception Abstract: Safe and reliable environmental perception is crucial for the highly automated or even autonomous operation of agriculture machines. However, developing such a system is challenging due to imperfect perception sensors. This article proposes a fault management system (FMS) for detecting, diagnosing, and mitigating risks that compromise the safety and reliability of perception systems. This article aims to develop an improved image quality safety model (IQSM) for the FMS to detect and diagnose the causes of performance insufficiencies in object detection. The IQSM exhibits remarkable performance, achieving an accuracy of about 98%, demonstrating its ability to effectively identify performance insufficiencies under pre-defined hazardous scenarios.

15:30-15:35, Paper ThDT20.4	Add to My Program
Learning to Prune Branches in Modern Tree-Fruit Orchards

Jain, Abhinav	Oregon State University
Grimm, Cindy	Oregon State University
Lee, Stefan	Oregon State University
Keywords: Robotics and Automation in Agriculture and Forestry, Visual Servoing, Field Robots Abstract: Dormant tree pruning is labor-intensive but essential to maintaining modern highly-productive fruit orchards. In this work we present a closed-loop visuomotor controller for robotic pruning. The controller guides the cutter through a cluttered tree environment to reach a specified cut point and ensures the cutters are perpendicular to the branch. We train the controller using a novel orchard simulation that captures the geometric distribution of branches in a target apple orchard configuration. Unlike traditional methods requiring full 3D reconstruction, our controller uses just optical flow images from a wrist-mounted camera. We deploy our learned policy in simulation and the real-world for an example V-Trellis envy tree with zero-shot transfer, achieving a sim30% success rate -- approximately half the performance of an oracle planner.

15:35-15:40, Paper ThDT20.5	Add to My Program
Towards Safe and Efficient Through-The-Canopy Autonomous Fruit Counting with UAVs

Yang, Teaya	UC Berkeley
Ibrahimov, Roman	UC Berkeley
Mueller, Mark Wilfried	University of California, Berkeley
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Agricultural Automation Abstract: We present an autonomous aerial system for safe and efficient through-the-canopy fruit counting. Aerial robot applications in large-scale orchards face significant challenges due to the complexity of fine-tuning flight paths based on orchard layouts, canopy density, and plant variability. Through-the-canopy navigation is crucial for minimizing occlusion by leaves and branches but is more challenging due to the complex and dense environment compared to traditional over-the-canopy flights. Our system addresses these challenges by integrating: i) a high-fidelity simulation framework for global path planning, ii) a low-cost autonomy stack for canopy-level navigation and data collection, and iii) a robust workflow for fruit detection and counting using RGB images. We validate our approach through fruit counting with canopy-level aerial images and by demonstrating the autonomous navigation capabilities of our experimental vehicle.

15:40-15:45, Paper ThDT20.6	Add to My Program
Language-Guided Object Search in Agricultural Environments

Balaji, Advaith	University of Michigan
Pradhan, Saket	University of Michigan
Berenson, Dmitry	University of Michigan
Keywords: Robotics and Automation in Agriculture and Forestry, Deep Learning Methods Abstract: Creating robots that can assist in farms and gardens can help reduce the mental and physical workload experienced by farm workers. We tackle the problem of object search in a farm environment, providing a method that allows a robot to semantically reason about the location of an unseen target object among a set of previously seen objects in the environment using a Large Language Model (LLM). We leverage object-to-object semantic relationships to plan a path through the environment that will allow us to accurately and efficiently locate our target object while also reducing the overall distance traveled, without needing high-level room or area-level semantic relationships. During our evaluations, we found that our method outperformed a current state-of-the-art baseline and our ablations. Our offline testing yielded an average path efficiency of 84%, reflecting how closely the predicted path aligns with the ideal path. Upon deploying our system on the Boston Dynamics Spot robot in a real-world farm environment, we found that our system had a success rate of 80%, with a success weighted by path length of 0.67, which demonstrates a reasonable trade-off between task success and path efficiency under real-world conditions. The project website can be viewed at: adi-balaji.github.io/losae

15:45-15:50, Paper ThDT20.7	Add to My Program
Robotic Grape Inspection and Selective Harvesting in Vineyards

Stavridis, Sotiris	Aristotle University of Thessaloniki
Droukas, Leonidas	Aristotle University of Thessaloniki
Doulgeri, Zoe	Aristotle University of Thessaloniki
Papageorgiou, Dimitrios	Hellenic Mediterranean University
Dimeas, Fotios	Aristotle University of Thessaloniki
Soriano, Angel	Robotnik Automation SL
Molina, Sergi	University of Lincoln
Deiri, Ahmed Sami	SAGA Robotics
Hutchinson, Michael	Saga Robotics
Pulido Fentanes, Jaime	Saga Robotics
Hroob, Ibrahim	University of Lincoln
Polvara, Riccardo	University of Lincoln
Hanheide, Marc	University of Lincoln
Cielniak, Grzegorz	University of Lincoln
Samarinas, Nikiforos	Laboratory of Remote Sensing, Spectroscopy, and GIS, School of A
Kateris, Dimitrios	Centre for Research and Technology Hellas (CERTH)
Bochtis, Dionysis	CERTH
Peleka, Georgia	CERTH, Thessaloniki Greece
Papadam, Stefanos	Certh / Iti
Triantafyllou, Dimitra	CERTH
Papadimitriou, Alexios	Certh / Iti
Papadopoulos, Christos	ITI/CERTH
Mariolis, Ioannis	CERTH
Giakoumis, Dimitris	Centre for Research and Technology Hellas
Tzovaras, Dimitrios	Centre for Research and Technology Hellas
Keywords: Robotics and Automation in Agriculture and Forestry, Bimanual Manipulation, Computer Vision for Automation Abstract: Driven by the increasing food demand and the need for higher-quality cultivation, precision agriculture grows steadily during the last decade. It involves the application of mobile robots and intelligent robotic technologies in various agricultural field tasks, concerning a variety of crop types. Aiming at compensating for the lack of selective robotic harvesting solutions regarding the high-value crop of grapes, the EU-funded project BACCHUS develops an intelligent mobile robotic system, comprising two independent and cooperative robots: one for the grape inspection and collection of valuable data regarding their maturity level, and one for the bimanual harvesting of grapes in a human-inspired manner. Validated via real-field trials, the proposed autonomous system pushes forward the precision agriculture application for a particularly sensitive crop type in the challenging and heavily cluttered environment of vineyards, facilitating the selective harvesting of high-quality grapes.


ThDT21 Regular Session, 410	Add to My Program
Diffusion for Manipulation

Chair: Duong, Thai	Rice University
Co-Chair: Pérez-D'Arpino, Claudia	NVIDIA

15:15-15:20, Paper ThDT21.1	Add to My Program
ProDapt: Proprioceptive Adaptation Using Long-Term Memory Diffusion

Pizarro Bejarano, Federico	University of Toronto
Jones, Bryson	NASA Jet Propulsion Laboratory
Pastor, Daniel	Caltech
Bowkett, Joseph	NASA Jet Propulsion Laboratory
Schoellig, Angela P.	TU Munich
Backes, Paul	Jet Propulsion Laboratory
Keywords: Machine Learning for Robot Control, Imitation Learning, Space Robotics and Automation Abstract: Diffusion models have revolutionized imitation learning, allowing robots to replicate complex behaviours. However, diffusion often relies on cameras and other exteroceptive sensors to observe the environment and lacks long-term memory. In space, military, and underwater applications, robots must be highly robust to failures in exteroceptive sensors, operating using only proprioceptive information. In this paper, we propose ProDapt, a method of incorporating long-term memory of previous contacts between the robot and the environment in the diffusion process, allowing it to complete tasks using only proprioceptive data. This is achieved by identifying "keypoints", essential past observations maintained as inputs to the policy. We test our approach using a UR10e robotic arm in both simulation and real experiments and demonstrate the necessity of this long-term memory for task completion.

15:20-15:25, Paper ThDT21.2	Add to My Program
Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners

Ng, Wen Zheng Terence	Nanyang Technological University
Chen, Jianda	Nanyang Technological University
Xu, Yuan	Nanyang Technological University
Zhang, Tianwei	Nanyang Technological University
Keywords: Deep Learning Methods, Reinforcement Learning, Representation Learning Abstract: This work addresses the challenge of personalizing automated decision-making systems by introducing a resource-efficient approach that enables rapid adaptation to individual users' preferences. Our method leverages a pretrained conditional diffusion model with Preference Latent Embeddings (PLE), trained on a large, reward-free offline dataset. The PLE serves as a compact representation for capturing specific user preferences. By adapting the pretrained model using our proposed preference inversion method, which directly optimizes the learnable PLE, we achieve superior alignment with human preferences compared to existing solutions like Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA). To better reflect practical applications, we create a benchmark experiment using real human preferences on diverse, optimal trajectories.

15:25-15:30, Paper ThDT21.3	Add to My Program
Joint Localization and Planning Using Diffusion

Lao Beyer, Lukas	Massachusetts Institute of Technology
Karaman, Sertac	Massachusetts Institute of Technology
Keywords: Deep Learning Methods, Localization, Autonomous Vehicle Navigation Abstract: Diffusion models have been successfully applied to robotics problems such as manipulation and vehicle path planning. In this work, we explore their application to end-to-end navigation -- including both perception and planning -- by considering the problem of jointly performing global localization and path planning in known but arbitrary 2D environments. In particular, we introduce a diffusion model which produces collision-free paths in a global reference frame given an egocentric LIDAR scan, an arbitrary map, and a desired goal position. To this end, we implement diffusion in the space of paths in SE(2), and describe how to condition the denoising process on both obstacles and sensor observations. In our evaluation, we show that the proposed conditioning techniques enable generalization to realistic maps of considerably different appearance than the training environment, demonstrate our model's ability to accurately describe ambiguous solutions, and run extensive simulation experiments showcasing our model's use as a real-time, end-to-end localization and planning stack.

15:30-15:35, Paper ThDT21.4	Add to My Program
Diverse Motion Planning with Stein Diffusion Trajectory Inference

Zeya, Yin	Univeristy of Sydney
Lai, Tin	University of Sydney
Barcelos, Lucas	University of Sydney
Jacob, Jayadeep	University of Sydney
Li, Yong Hui	Univeristy of Sydney
Ramos, Fabio	University of Sydney, NVIDIA
Keywords: Probabilistic Inference, Integrated Planning and Learning Abstract: Acquiring prior knowledge of trajectory distributions in specific environments can significantly expedite the optimisation process in robot motion planning. Leveraging successful past plans and utilising trajectory generative models as priors offers a clear advantage. Previous studies have proposed various methods to harness these priors, such as using prior samples for initialisation or incorporating the prior distribution into trajectory optimisation through inference. Recently, diffusion models have demonstrated effectiveness in encoding multimodal data in high-dimensional settings. In this study, we propose a method that uses diffusion models as priors and employs Stein variational inference with Gaussian Process trajectories to integrate them into a batch inverse denoising process. This approach reduces the computation time required to approximate the posterior distribution of trajectories, particularly when adapting to new, unseen environments. Additionally, we incorporate path signatures into our method to enhance the diversity of the posterior distribution. To validate our approach, we conduct comparative assessments against multiple baseline methods across various scenarios, including 2D planar robots and robotic manipulators. Our experiments demonstrate that our method identifies the optimal solution with significantly reduced computational time.

15:35-15:40, Paper ThDT21.5	Add to My Program
The Ingredients for Robotic Diffusion Transformers

Dasari, Sudeep	Carnegie Mellon University
Mees, Oier	University of California, Berkeley
Zhao, Sebastian	University of California, Berkeley
Srirama, Mohan Kumar	Carnegie Mellon University
Levine, Sergey	UC Berkeley
Keywords: Machine Learning for Robot Control, Learning from Demonstration, Deep Learning in Grasping and Manipulation Abstract: In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named DiT-Block Policy, that significantly outperforms the state of the art in solving long-horizon (1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io

15:40-15:45, Paper ThDT21.6	Add to My Program
Inference-Time Policy Steering through Human Interactions

Wang, Yanwei	MIT
Wang, Lirui	Massachusetts Institute of Technology
Du, Yilun	MIT
Sundaralingam, Balakumar	NVIDIA Corporation
Yang, Xuning	NVIDIA
Chao, Yu-Wei	NVIDIA
Pérez-D'Arpino, Claudia	NVIDIA
Fox, Dieter	University of Washington
Shah, Julie A.	MIT
Keywords: Imitation Learning, Human-Robot Collaboration, Deep Learning Methods Abstract: Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than fine-tuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/.

15:45-15:50, Paper ThDT21.7	Add to My Program
Legibility Diffuser: Offline Imitation for Intent Expressive Motion

Bronars, Matthew	Carnegie Mellon University
Cheng, Shuo	Gatech
Xu, Danfei	Georgia Institute of Technology
Keywords: Imitation Learning, Human-Robot Collaboration, Deep Learning Methods Abstract: In human-robot collaboration, legible motion that conveys a robot's intentions and goals is known to improve safety, task efficiency, and user experience. Legible robot motion is typically generated using hand-designed cost functions and classical motion planners. However, with the rise of deep learning and data-driven robot policies, we need methods for training end-to-end on offline demonstration data. In this paper, we propose Legibility Diffuser, a diffusion-based policy that learns intent expressive motion directly from human demonstrations. By variably combining the noise predictions from a goal-conditioned diffusion model, we guide the robot's motion toward the most legible trajectory in the training dataset. We find that decaying the guidance weight over the course of the trajectory is critical for maintaining a high success rate while maximizing legibility.


ThDT22 Regular Session, 411	Add to My Program
Imitation Learning 3

Chair: Kober, Jens	TU Delft
Co-Chair: Bıyık, Erdem	University of Southern California

15:15-15:20, Paper ThDT22.1	Add to My Program
Visually Robust Adversarial Imitation Learning from Videos with Contrastive Learning

Giammarino, Vittorio	Boston University
Queeney, James	Mitsubishi Electric Research Laboratories
Paschalidis, Ioannis	Boston University
Keywords: Imitation Learning, Reinforcement Learning, Visual Learning Abstract: We propose C-LAIfO, a computationally efficient algorithm designed for imitation learning from videos in the presence of visual mismatch between agent and expert domains. We analyze the problem of imitation from expert videos with visual discrepancies, and introduce a solution for robust latent space estimation using contrastive learning and data augmentation. Provided a visually robust latent space, our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning. We conduct a thorough ablation study to justify our design and test C-LAIfO on high-dimensional continuous robotic tasks. Additionally, we demonstrate how C-LAIfO can be combined with other reward signals to facilitate learning on a set of challenging hand manipulation tasks with sparse rewards. Our experiments show improved performance compared to baseline methods, highlighting the effectiveness of C-LAIfO. To ensure reproducibility, we open source our code.

15:20-15:25, Paper ThDT22.2	Add to My Program
One-Shot Imitation under Mismatched Execution

Kedia, Kushal	Cornell University
Dan, Prithwish	Cornell University
Chao, Angela	Cornell University
Pace, Maximus	Cornell University
Choudhury, Sanjiban	Cornell University
Keywords: Learning from Demonstration, Representation Learning, Transfer Learning Abstract: Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.

15:25-15:30, Paper ThDT22.3	Add to My Program
RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Dai, Yinpei	University of Michigan
Lee, Jayjun	University of Michigan
Fazeli, Nima	University of Michigan
Chai, Joyce	University of Michigan
Keywords: Imitation Learning, Data Sets for Robot Learning, Deep Learning in Grasping and Manipulation Abstract: Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io

15:30-15:35, Paper ThDT22.4	Add to My Program
Improving Vision-Language-Action Model with Online Reinforcement Learning

Guo, Yanjiang	Tsinghua University
Zhang, Jianke	Tsinghua University
Chen, Xiaoyu	Tsinghua University
Ji, Xiang	Tsinghua University
Wang, Yen-Jen	University of California, Berkeley
Hu, Yucheng	Tsinghua
Chen, Jianyu	Tsinghua University
Keywords: Imitation Learning, Continual Learning, Reinforcement Learning Abstract: Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing demands that exceed the capabilities of most local machines. To address these problems, we propose iRe-VLA framework, which iterates between Reinforcement Learning and supervised learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.

15:35-15:40, Paper ThDT22.5	Add to My Program
MILE: Model-Based Intervention Learning

Korkmaz, Yigit	University of Southern California
Bıyık, Erdem	University of Southern California
Keywords: Imitation Learning, AI-Based Methods, Human Factors and Human-in-the-Loop Abstract: Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the robot and intervenes if needed, these extensions usually only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention timesteps. In this work, we create a model to formulate how the interventions occur in such cases, and show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study. Videos and the code can be found at https://liralab.usc.edu/mile.

15:40-15:45, Paper ThDT22.6	Add to My Program
Validity Learning on Failures: Mitigating the Distribution Shift in Autonomous Vehicle Planning

Arasteh, Fazel	Noah's Ark Lab, Huawei
Elmahgiubi, Mohammed	Huawei Technologies Inc
Khamidehi, Behzad	University of Toronto
Mirkhani, Hamidreza	Huawei Technologies Canada
Zhang, Weize	Huawei
Cao, Tongtong	Noah's Ark Lab, Huawei Technologies
Rezaee, Kasra	Huawei Technologies
Keywords: Imitation Learning, Learning from Demonstration, Reinforcement Learning Abstract: The planning problem constitutes a fundamental aspect of the autonomous driving framework. Recent strides in representation learning have empowered vehicles to comprehend their surrounding environments, thereby facilitating the integration of learning-based planning strategies. Among these approaches, Imitation Learning stands out due to its notable training efficiency. However, traditional Imitation Learning methodologies encounter challenges associated with the covariate shift phenomenon. We propose Validity Learning on Failures, VL(on failure), as a remedy to address this issue. The essence of our method lies in deploying a pre-trained planner across diverse scenarios. Instances where the planner deviates from its immediate objectives, such as maintaining a safe distance from obstacles or adhering to traffic rules, are flagged as failures. The states corresponding to these failures are compiled into a new dataset, termed the failure dataset. Notably, the absence of expert annotations for this data precludes the applicability of standard imitation learning approaches. To facilitate learning from the closed-loop mistakes, we introduce the VL objective which aims to discern valid trajectories within the current environmental context. Experimental evaluations conducted on both reactive CARLA simulation and non-reactive log-replay simulations reveal substantial enhancements in closed-loop metrics such as Score, Progress, and Success Rate, underscoring the effectiveness of the proposed methodology. Further evaluations against the Bench2Drive benchmark demonstrate that VL(on failure) outperforms the state-of-the-art methods by a large margin.

15:45-15:50, Paper ThDT22.7	Add to My Program
Iteratively Adding Latent Human Knowledge within Trajectory Optimization Specifications Improves Learning and Task Outcomes

Chang, Christine T	University of Colorado Boulder
Stull, Maria P	University of Colorado Boulder
Crockett, Breanne	University of Colorado Boulder
Jensen, Emily	University of Colorado Boulder
Lohrmann, Clare	University of Colorado Boulder
Hebert, Mitchell	Draper
Hayes, Bradley	University of Colorado Boulder
Keywords: Human Factors and Human-in-the-Loop, Human-Robot Teaming, Aerial Systems: Applications Abstract: Frictionless and understandable tasking is essential for leveraging human-autonomy teaming in commercial, military, and public safety applications. Existing technology for facilitating human teaming with uncrewed aerial vehicles (UAVs), utilizing planners or trajectory optimizers that incorporate human input, introduces a usability and operator capability gap by not explicitly effecting user upskilling by promoting system understanding or predictability. Supplementing annotated waypoints with natural language guidance affords an opportunity for both. In this work we investigate one-shot versus iterative input, introducing a testbed system based on government and industry UAV planning tools that affords inputs in the form of both natural language text and drawn annotations on a terrain map. The testbed uses an LLM-based subsystem to map user inputs into additional terms for the trajectory optimization objective function. We demonstrate through a human subjects study that prompting a human teammate to iteratively add latent knowledge to a trajectory optimization aids the user in learning how the system functions, elicits more desirable robot behaviors, and ultimately achieves better task outcomes.


ThDT23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 6

Chair: Dam, Tanmoy	Emory University
Co-Chair: Ding, Zhengming	Tulane University

15:15-15:20, Paper ThDT23.1	Add to My Program
HybridOcc: NeRF Enhanced Transformer-Based Multi-Camera 3D Occupancy Prediction

Zhao, Xiao	Fudan University
Chen, Bo	FAW Group
Sun, Mingyang	Fudan University
Yang, Dingkang	Fudan University
Wang, Youxing	Fudan University
Zhang, Xukun	Fudan University
Li, Mingcheng	Fudan University
Kou, Dongliang	Fudan University
Wei, Xiaoyi	Fudan University
ZHang, Lihua	Fudan University
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Recognition Abstract: Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined in a coarse-to-fine SSC prediction framework. HybridOcc aggregates contextual features through the Transformer paradigm based on hybrid query proposals while combining it with NeRF representation to obtain depth supervision. The Transformer branch contains multiple scales and uses spatial cross-attention for 2D to 3D transformation. The newly designed NeRF branch implicitly infers scene occupancy through volume rendering, including visible and invisible voxels, and explicitly captures scene depth rather than generating RGB color. Furthermore, we present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface, further improving the overall performance. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our HybridOcc on the SSC task.

15:20-15:25, Paper ThDT23.2	Add to My Program
Temporal Consistency for RGB-Thermal Data-Based Semantic Scene Understanding

Li, Haotian	The Hong Kong Polytechnic University
Chu, Henry	The Hong Kong Polytechnic University
Sun, Yuxiang	City University of Hong Kong
Keywords: Automation Technologies for Smart Cities, Intelligent Transportation Systems Abstract: Semantic scene understanding is a fundamental capability for autonomous vehicles. Under challenging lighting conditions, such as nighttime and on-coming headlights, the semantic scene understanding performance using only RGB images are usually degraded. Thermal images can provide complementary information to RGB images, so many recent semantic segmentation networks have been proposed using RGB-Thermal (RGB-T) images. However, most existing networks focus only on improving segmentation accuracy for single image frames, omitting the information consistency between consecutive frames. To provide a solution to this issue, we propose a temporal-consistent framework for RGB-T semantic segmentation, which introduces a virtual view image generation module to synthesize a virtual image for the next moment, and a consistency loss function to ensure the segmentation consistency. We also propose an evaluation metric to measure both the accuracy and consistency for semantic segmentation. Experimental results show that our framework outperforms state-of-the-art methods.

15:25-15:30, Paper ThDT23.3	Add to My Program
SaViD: Spectravista Aesthetic Vision Integration for Robust and Discerning 3D Object Detection in Challenging Environments

Dam, Tanmoy	Emory University
Dharavath, Sanjay Bhargav	Indian Institute of Technology, Kharagpur, India
Alam, Sameer	Saab-NTU Joint Lab, Nanyang Technological University, Singapore
Lilith, Nimrod	Saab-NTU Joint Lab, Nanyang Technological University, Singapore
Maiti, Aniruddha	ADP
Chakraborty, Supriyo	Indian Institute of Technology, Kharagpur, India
Feroskhan, Mir	Nanyang Technological University
Keywords: Object Detection, Segmentation and Categorization, Autonomous Vehicle Navigation, Sensor Fusion Abstract: The fusion of LiDAR and camera sensors has demonstrated significant effectiveness in achieving accurate detection for short-range tasks in autonomous driving. However, this fusion approach could face challenges when dealing with long-range detection scenarios due to disparity between sparsity of LiDAR and high-resolution camera data. Moreover, sensor corruption introduces complexities that affect the ability to maintain robustness, despite the growing adoption of sensor fusion in this domain. We present SaViD, a novel framework comprised of a three-stage fusion alignment mechanism designed to address long-range detection challenges in the presence of natural corruption. The SaViD framework consists of three key elements: the Global Memory Attention Network (GMAN), which enhances the extraction of image features through offering a deeper understanding of global patterns; the Attentional Sparse Memory Network (ASMN), which enhances the integration of LiDAR and image features; and the KNNnectivity Graph Fusion (KGF), which enables the entire fusion of spatial information. SaViD achieves superior performance on the long-range detection Argoverse-2 (AV2) dataset with a performance improvement of 9.87% in AP value and an improvement of 2.39% in mAPH for L2 difficulties on the Waymo Open dataset (WOD). Comprehensive experiments are carried out to showcase its robustness against 14 natural sensor corruptions. SaViD exhibits a robust performance improvement of 31.43% for AV2 and 16.13% for WOD in RCE value compared to other existing fusion-based methods while considering all the corruptions for both datasets. Our code is available at href{https://anonymous.4open.science/r/SAVID-2A0D/README.m d}{textcolor{blue}{SaViD}}.

15:30-15:35, Paper ThDT23.4	Add to My Program
CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection Based View Transformation

Lee, In-Jae	Seoul National University
Hwang, Sihwan	Korea Advanced Institute of Science and Technology
Kim, Youngseok	Korea Advanced Institute of Science and Technology
Kim, Wonjune	Korea Advanced Institute of Science and Technology
Kim, Sanmin	Kookmin University
Kum, Dongsuk	KAIST
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Transportation, Deep Learning for Visual Perception Abstract: Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.

15:35-15:40, Paper ThDT23.5	Add to My Program
Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning

Li, Jianhao	Beihang University
Sun, Tianyu	Tsinghua University
Zhang, Xueqian	Tsinghua University
Wang, Zhongdao	Noah's Ark Laboratory
Feng, Bailan	Noah's Ark Laboratory
Xu, Ke	Beihang University
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Transportation Abstract: This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive LiDAR sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of LiDAR sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-to-end sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Extensive experiments show that our pruning strategy improves several perception algorithms in multiple tasks.

15:40-15:45, Paper ThDT23.6	Add to My Program
RoBiFusion: A Robust and Bidirectional Interaction Camera-LiDAR 3D Object Detection Framework

Wen, Xubin	Southeast University
Xia, Haifeng	Southeast University
Ding, Zhengming	Tulane University
Xia, Siyu	Southeast University
Keywords: Object Detection, Segmentation and Categorization, Intelligent Transportation Systems, Sensor Fusion Abstract: Camera-LiDAR 3D object detection is currently becoming a crucial component in the field of autonomous driving perception. However, previous models only performed feature fusion in the deep-level BEV hierarchy when dealing with camera-LiDAR feature fusion. This approach lacks interaction with the shallow-level sensor features, which is beneficial in constructing the corresponding BEV features. However, a simple shallow-level feature interaction can introduce sensor noise caused by intrinsic and extrinsic camera calibration errors. To address this, we propose RoBiFusion, a novel camera-LiDAR 3D object detection framework designed for effective sensor feature interaction and mitigating sensor noise interference. This framework consists of three submodules: the Camera-LiDAR Feature Matching module, the LiDAR-to-Camera module, and the Camera-to-LiDAR module. Firstly, in the Camera-LiDAR Feature Matching module, we use the cross-attention module to dynamically match the camera features and the LiDAR features, which solves the problem of feature inconsistency caused by noise in the camera's intrinsic and extrinsic parameters. Secondly, in the LiDAR-to-Camera module, we propose a novel depth representation that can effectively mitigate LiDAR noise interference. Thirdly, in the Camera-to-LiDAR module, we introduce deformable attention to help LiDAR feature capture instance-level semantic features. Additionally, we design a novel differentiable and efficient grid sample module to accelerate the process since the bilinear grid sample module in deformable attention is time-consuming and not deployment-friendly. We compared RoBiFusion to the state-of-the-art BEVFusion on the nuScenes dataset and found that RoBiFusion surpasses BEVFusion by 1.5% mAP and 2.4% NDS. Furthermore, we designed a series of ablation experiments to verify the effectiveness of the aforementioned modules.

15:45-15:50, Paper ThDT23.7	Add to My Program
Towards Accurate Semi-Supervised BEV 3D Object Detection with Depth-Aware Refinement and Denoising-Aided Alignment

Yang, Zhao	Xi'an Jiaotong University
Shi, Yinan	Technical University Munich
Zhu, Jiangtong	Xi'an Jiaotong University
Xu, Weixiang	Institute of Automation, Chinese Academy of Sciences
Liu, Longjun	Xi'an Jiaotong University
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Deep Learning for Visual Perception Abstract: Recently, camera-based Bird’s-Eye View (BEV) representation has gained significant traction in 3D object detection. However, training high-performance BEV 3D detectors typically requires a large number of annotated samples, which can be costly. Traditional semi-supervised methods for BEV 3D object detection face challenges including loss of rich depth information, inconsistent object representations across spaces, and unreliable pseudo label generation, leading to decreased accuracy and performance. Addressing this challenge, we pioneer the introduction of a semi-supervised BEV 3D object detection framework. Our approach leverages a small set of labeled data alongside a larger set of unlabeled data, significantly reducing annotation costs while maintaining robust detection performance. Firstly, we propose a depth-based self-refinement module to generate high-quality and stable pseudo labels, which can effectively regulate training with noisy labels. Secondly, we designed a denoising labels regression module that integrates denoising for both labeled and unlabeled data. Thirdly, in order to alleviate object inconsistency, we propose a consistent object-guided alignment method to ensure the consistency of objects in multi-spaces. Finally, our method can be easily plugged into various BEV 3D detection networks. Extensive experiments show that the proposed method achieves a new state-of-the-art compared to various camera-based 3D detectors tested on multiple public autonomous driving datasets.


ThET1 Regular Session, 302	Add to My Program
Visual Perception and Learning

Chair: Zhang, Hao	University of Massachusetts Amherst
Co-Chair: Zhang, Jing	New York University

16:35-16:40, Paper ThET1.1	Add to My Program
Open-RGBT: Open-Vocabulary RGB-T Zero-Shot Semantic Segmentation in Open-World Environments

Yu, Meng	Beijing Institute of Technology
Yue, Yufeng	Beijing Institute of Technology
Yang, Luojie	Beijing Institute of Technology
He, Xunjie	Beijing Institute of Technology
Yang, Yi	Beijing Institute of Technology
Fu, Mengyin	Beijing Institute of Technology
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception Abstract: Semantic segmentation is a critical technique for effective scene understanding. Traditional RGB-T semantic segmentation models often struggle to generalize across diverse scenarios due to their reliance on pretrained models and predefined categories. Recent advancements in Visual Language Models (VLMs) have facilitated a shift from closed-set to open-vocabulary semantic segmentation methods. However, these models face challenges in dealing with intricate scenes, primarily due to the heterogeneity between RGB and thermal modalities. To address this gap, we present Open-RGBT, a novel open-vocabulary RGB-T semantic segmentation model. Specifically, we obtain instance-level detection proposals by incorporating visual prompts to enhance category understanding. Additionally, we employ the CLIP model to assess image-text similarity, which helps correct semantic consistency and mitigates ambiguities in category identification. Empirical evaluations demonstrate that Open-RGBT achieves superior performance in diverse and challenging real-world scenarios, even in the wild, significantly advancing the field of RGB-T semantic segmentation. The project page of Open-RGBT is available at https://OpenRGBT.github.io/.

16:40-16:45, Paper ThET1.2	Add to My Program
Positioning in Congested Space by Combining Vision-Based and Proximity-Based Control

Thomas, John	Institut Pascal
Chaumette, Francois	Inria Center at University of Rennes
Keywords: Sensor-based Control, Visual Servoing Abstract: In this paper, we consider positioning in congested space within the framework of Sensor-based Control (SBC) using vision and proximity sensors. Vision acts as primary sensing modality for performing the positioning task, while proximity sensors complement it by ensuring that the robotic platform does not collide with objects in the workspace. Sensor information is combined in a shared manner using the QP formalism where ideas from safety-critical control are used to express inequality constraints. The proposed method is validated through various real experiments.

16:45-16:50, Paper ThET1.3	Add to My Program
SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation

Li, Jianing	Nanjing University
Lu, Ming	Intel Labs
Liu, Juntao	China Mobile Research Institute
Wang, Hao	Peking University
Gu, Chenyang	Peking University
Zheng, Wenzhao	Tsinghua University
Du, Li	Nanjing University
Zhang, Shanghang	Peking University
Keywords: Computer Vision for Manufacturing, Deep Learning for Visual Perception, Visual Learning Abstract: 3D semantic occupancy prediction is a crucial task in visual perception, demanding a simultaneous understanding of both scene geometry and semantics. It plays a pivotal role in 3D scene comprehension and holds great potential for various applications, such as robotic vision perception and autonomous driving. Many previous works leverage planar-based representations like Bird’s Eye View (BEV) and Tri-Perspective View (TPV), which aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments where occlusions are prevalent, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we introduce a novel vertical slice representation, which divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To harness these slice features, we propose SliceOcc, a camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then combined to form a global scene representation, which is employed for indoor occupancy estimation. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories, setting a new state-of-the-art performance among RGB-based models for indoor 3D semantic occupancy prediction.

16:50-16:55, Paper ThET1.4	Add to My Program
Bandwidth-Adaptive Spatiotemporal Correspondence Identification for Collaborative Perception

Gao, Peng	North Carolina State University
Jose, Williard Joshua	University of Massachusetts Amherst
Zhang, Hao	University of Massachusetts Amherst
Keywords: RGB-D Perception, Deep Learning Methods, Multi-Robot Systems Abstract: Correspondence identification (CoID) is an essential capability for multi-robot collaborative perception, which allows a group of robots to consistently refer to the same objects in their own fields of view. In real-world applications, such as connected autonomous driving, connected vehicles cannot directly share their raw observations due to the limited communication bandwidth. To address this challenge, we propose a novel approach of bandwidth-adaptive spatiotemporal CoID for collaborative perception, where robots interactively select partial spatiotemporal observations to share with others, while adapting to the communication constraint that dynamically changes over time. We evaluate our approach over various scenarios in connected autonomous driving simulations. Experimental results have demonstrated that our approach enables CoID and adapts to the dynamic change of bandwidth constraints. In addition, our approach achieves 8%-56% overall improvements in terms of covisible object retrieval for CoID and data sharing efficiency, which outperforms the previous techniques and achieves the state-of-the-art performance. More information is available at: https://gaopeng5.github.io/acoid/.

16:55-17:00, Paper ThET1.5	Add to My Program
Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion

Liu, Shengyuan	The Chinese University of Hong Kong
Chen, Zhen	Centre for Artificial Intelligence and Robotics (CAIR), Hong Kon
Yang, Qiushi	City University of Hong Kong
Yu, Weihao	Chinese University of Hong Kong
Dong, Di	Institute of Automation, Chinese Academy of Sciences
Hu, Jiancong	The Sixth Affiliated Hospital, Sun Yat-Sen University
Yuan, Yixuan	Chinese University of Hong Kong
Keywords: Computer Vision for Automation, Medical Robots and Systems, Deep Learning for Visual Perception Abstract: Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.

17:00-17:05, Paper ThET1.6	Add to My Program
DetailRefine: Towards Fine-Grained and Efficient Online Monocular 3D Reconstruction

Chu, Fupeng	Chinese Academy of Sciences
Cong, Yang	Chinese Academy of Science, China
Chen, Ronghan	Sheyang Institute of Automation, Chinese Academy of Sciences
Keywords: Computer Vision for Automation, Visual Learning, Deep Learning for Visual Perception Abstract: Online monocular 3D reconstruction has attracted widespread attention as it promotes the application of robots in interactive scenarios. Most existing methods focus on 1) real-time reconstruction, 2) accurate voxel featuring learning, and 3) effective voxel sparsification algorithm. To this end, 1) they adopt a coarse-to-fine pipeline, where all non-empty voxels are sent to the next level for refinement. However, this results in over-refinement of flat regions, leading to unnecessary computational overhead. Furthermore, 2) advanced methods focus on exploring view visibility but overlook the discriminability among visible views, which limits the representation of learned voxel features. Moreover, 3) existing sparsification algorithms struggle to distinguish detailed and empty voxels, resulting in either the loss of detailed voxels or the retention of empty voxels. To tackle these challenges, 1) we present Dynamic Detail Refinement (DDR) to allocate more voxels to detailed regions for refinement, which could alleviate the computational burden. Furthermore, 2) we propose Discriminability-Aware Fusion (DAF) to focus on discriminative views, which helps to capture accurate voxel features. In addition, 3) we propose Hierarchical Hybrid Sparsification (HHS) to balance global completeness and local refinement, which helps to preserve detailed voxels at hierarchical levels effectively. Extensive experiments conducted on the representative ScanNet (V2) and 7-Scenes datasets demonstrate the superiority of the proposed method.

17:05-17:10, Paper ThET1.7	Add to My Program
DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-Light Enhancement and Deblurring

Wang, Ling	HKUST(GZ)
Wu, Chen	University of Science and Technology of China
Wang, Lin	Nanyang Technological University (NTU)
Keywords: Visual Learning, Deep Learning for Visual Perception Abstract: Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (eg., color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, eg., Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: url{https://vlislab22.github.io/dap-led/}.

17:10-17:15, Paper ThET1.8	Add to My Program
FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

Fang, Irving	New York University
Shi, Kairui	New York University
He, Xujin	New York University
Tan, Siqi	New York University
Wang, Yifan	New York University
Zhao, Hanwen	New York University
Huang, Hung-Jui	Carnegie Mellon University
Yuan, Wenzhen	University of Illinois
Feng, Chen	New York University
Zhang, Jing	NYU
Keywords: Deep Learning for Visual Perception, Force and Tactile Sensing, Object Detection, Segmentation and Categorization Abstract: Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.


ThET2 Regular Session, 301	Add to My Program
Multi-Robot SLAM and Mapping

Chair: Chli, Margarita	ETH Zurich & University of Cyprus
Co-Chair: Morbidi, Fabio	Université De Picardie Jules Verne

16:35-16:40, Paper ThET2.1	Add to My Program
Multi-Robot Object SLAM Using Distributed Variational Inference

Cao, Hanwen	University of California, San Diego
Shreedharan, Sriram	University of California, San Diego
Atanasov, Nikolay	University of California, San Diego
Keywords: Multi-Robot SLAM, Distributed Robot Systems, Probability and Statistical Methods Abstract: Multi-robot simultaneous localization and mapping (SLAM) enables a robot team to achieve coordinated tasks by relying on a common map of the environment. Constructing a map by centralized processing of the robot observations is undesirable because it creates a single point of failure and requires pre-existing infrastructure and significant communication throughput. This paper formulates multi-robot object SLAM as a variational inference problem over a communication graph subject to consensus constraints on the object estimates maintained by different robots. To solve the problem, we develop a distributed mirror descent algorithm with regularization enforcing consensus among the communicating robots. Using Gaussian distributions in the algorithm, we also derive a distributed multi-state constraint Kalman filter (MSCKF) for multi-robot object SLAM. Experiments on real and simulated data show that our method improves the trajectory and object estimates, compared to individual-robot SLAM, while achieving better scaling to large robot teams, compared to centralized multi-robot SLAM.

16:40-16:45, Paper ThET2.2	Add to My Program
DVM-SLAM: Decentralized Visual Monocular Simultaneous Localization and Mapping for Multi-Agent Systems

Bird, Joshua	University of Cambridge
Blumenkamp, Jan	University of Cambrdige
Prorok, Amanda	University of Cambridge
Keywords: Multi-Robot Systems, Multi-Robot SLAM, SLAM Abstract: Cooperative Simultaneous Localization and Mapping (C-SLAM) enables multiple agents to work together in mapping unknown environments while simultaneously estimating their own positions. This approach enhances robustness, scalability, and accuracy by sharing information between agents, reducing drift, and enabling collective exploration of larger areas. In this paper, we present Decentralized Visual Monocular SLAM (DVM-SLAM), the first open-source decentralized monocular C-SLAM system. By only utilizing low-cost and light-weight monocular vision sensors, our system is well suited for small robots and micro aerial vehicles (MAVs). DVM-SLAM's real-world applicability is validated on physical robots with a custom collision avoidance framework, showcasing its potential in real-time multi-agent autonomous navigation scenarios. We also demonstrate comparable accuracy to state-of-the-art centralized monocular C-SLAM systems. We open-source our code and provide supplementary material online.

16:45-16:50, Paper ThET2.3	Add to My Program
TCAFF: Temporal Consistency for Robot Frame Alignment

Peterson, Mason B.	Massachusetts Institute of Technology
Lusk, Parker C.	Massachusetts Institute of Technology
Avila, Antonio	Massachusetts Institute of Technology
How, Jonathan	Massachusetts Institute of Technology
Keywords: Localization, Multi-Robot SLAM Abstract: In the field of collaborative robotics, the ability to communicate spatial information like planned trajectories and shared environment information is crucial. When no global position information is available (e.g., indoor or GPS-denied environments), agents must align their coordinate frames before shared spatial information can be properly expressed and interpreted. Coordinate frame alignment is particularly difficult when robots have no initial alignment and are affected by odometry drift. To this end, we develop a novel multiple hypothesis algorithm, called TCAFF, for aligning the coordinate frames of neighboring robots. TCAFF considers potential alignments from associating sparse open-set object maps and leverages temporal consistency to determine an initial alignment and correct for drift, all without any initial knowledge of neighboring robot poses. We demonstrate TCAFF being used for frame alignment in a collaborative object tracking application on a team of four robots tracking six pedestrians and show that TCAFF enables robots to achieve a tracking accuracy similar to that of a system with ground truth localization. The code and hardware dataset are available at https://github.com/mit-acl/tcaff.

16:50-16:55, Paper ThET2.4	Add to My Program
Effective Heterogeneous Point Cloud-Based Place Recognition and Relative Localization for Ground and Aerial Vehicles

Mao, Rui	Sun Yat-Sen University
Cheng, Hui	Sun Yat-Sen University
Keywords: Range Sensing, Localization, Multi-Robot SLAM Abstract: Place recognition and relative localization are crucial for realizing the potential of collaboration in ground and aerial robot teams. Many existing works focus only on ground robots and are not well-suited for heterogeneous robot systems in large-scale environments. In this paper, we propose a novel pipeline based on BEV density image, combined with an enhanced data structure, for place recognition in air-ground robotic collaboration systems. An efficient height alignment algorithm is proposed for relative localization. Extensive experiments on various types of public datasets validate the efficacy of our method compared to other SOTA works. We also show that our method is capable to detect inter- and intra-robot loop closures in a ground and aerial multi-session SLAM system.

16:55-17:00, Paper ThET2.5	Add to My Program
Distributed Invariant Kalman Filter for Object-Level Multi-Robot Pose SLAM

Li, Haoying	Chinese University of Hong Kong, Shenzhen
Zeng, Qingcheng	The Hong Kong University of Science and Technology (Guangzhou)
Li, Haoran	Chinese University of Hong Kong, Shenzhen
Zhang, Yanglin	The Chinese University of Hong Kong, Shenzhen
Wu, Junfeng	The Chinese Unviersity of Hong Kong, Shenzhen
Keywords: Distributed Robot Systems, Multi-Robot SLAM, Autonomous Agents Abstract: Cooperative localization and target tracking are essential for multi-robot systems to implement high-level tasks. To this end, we propose a distributed invariant Kalman filter~(KF) based on covariance intersection~(CI) for effective multi-robot pose estimation. The paper utilizes the object-level measurement models, which have condensed information further reducing the communication burden. Besides, by modeling states on special Lie groups, and representing uncertainty in corresponding Lie algebras, better linearity and consistency are obtained under the invariant KF framework. We also use a combination of CI and KF to avoid overly confident or conservative estimates in multi-robot systems with intricate and unknown correlations, and some level of robot degradation is acceptable through multi-robot collaboration. The simulation and real data experiment validate the practicability and superiority of the proposed algorithm.

17:00-17:05, Paper ThET2.6	Add to My Program
MT-PCR: Leveraging Modality Transformation for Large-Scale Point Cloud Registration with Limited Overlap

Wu, Yilong	University of Science and Technology of China
Duan, Yifan	University of Science and Technology of China
Chen, Yuxi	University of Science and Technology of China
Zhang, Xinran	University of Science and Technology of China
Shen, Yedong	University of Science & Technology of China
Ji, Jianmin	University of Science and Technology of China
Zhang, Yanyong	University of Science and Technology of China
Zhang, Lu	Institute of Artificial Intelligence, Hefei Comprehensive Nation
Keywords: Multi-Robot SLAM, Aerial Systems: Perception and Autonomy, Mapping Abstract: Large-scale scene point cloud registration with limited overlap is a challenging task due to computational load and constrained data acquisition. To tackle these issues, we propose a point cloud registration method, MT-PCR, based on Modality Transformation. MT-PCR leverages a Bird’s Eye View (BEV) capturing the maximal overlap information to improve the accuracy and utilizes images to provide complementary spatial features. Specifically, MT-PCR converts 3D point clouds to BEV images and estimates correspondence by 2D image keypoints extraction and matching. Subsequently, the 2D correspondence estimates are then transformed back to 3D point clouds using inverse mapping. We have applied MT-PCR to Terrestrial Laser Scanning (TLS) and Aerial Laser Scanning (ALS) point cloud registration on the GrAco dataset, involving 8 low-overlap, square-kilometer scale registration scenarios. Experiments and comparisons with commonly used methods demonstrate that MT-PCR can achieve superior accuracy and robustness in large-scale scenes with limited overlap.

17:05-17:10, Paper ThET2.7	Add to My Program
Large-Scale Multi-Session Point-Cloud Map Merging

Wei, Hairuo	The University of Hong Kong
Li, Rundong	University of Hong Kong
Cai, Yixi	KTH Royal Institute of Technology
Yuan, Chongjian	The University of Hong Kong
Ren, Yunfan	The University of Hong Kong
Zou, Zuhao	HongKong University
Wu, Huajie	Hong Kong University
Zheng, Chunran	The University of Hong Kong
Zhou, Shunbo	Huawei
Xue, Kaiwen	The Chinese University of Hong Kong, Shenzhen
Zhang, Fu	University of Hong Kong
Keywords: Multi-Robot SLAM, Mapping, SLAM Abstract: This paper introduces LAMM, an open-source framework for large-scale multi-session 3D LiDAR point cloud map merging. LAMM can automatically integrate sub-maps from multiple agents carrying LiDARs with different scanning patterns, facilitating place feature extraction, data association, and global optimization in various environments. Our framework incorporates two key novelties that enable robust, accurate, large-scale map merging. The first novelty is a temporal bidirectional filtering mechanism that removes dynamic objects from 3D LiDAR point cloud data. This eliminates the effect of dynamic objects on the 3D map model, providing higher-quality map merging results. The second novelty is a robust and efficient outlier removal algorithm for detected loop closures. This algorithm ensures a high recall rate and a low false alarm rate in position retrieval, significantly reducing outliers in repetitive environments during large-scale merging. We evaluate our framework using various datasets, including KITTI, H


ThET3 Regular Session, 303	Add to My Program
Robotics and Automation in Life Science and Rescue Applications

Chair: Kaiser, Tanja Katharina	University of Technology Nuremberg
Co-Chair: Alterovitz, Ron	University of North Carolina at Chapel Hill

16:35-16:40, Paper ThET3.1	Add to My Program
The qPCRBot: Combining Automated Data Handling, Standardization, and Robotic Labware Transport for Better qPCR Measurements

Zwirnmann, Henning	Technical University of Munich
Eckhoff, Moritz	Technical University of Munich
Knobbe, Dennis	Technical University of Munich
Fülöp, Dorian	Technical University of Munich (TUM)
Gabrielli, Andrea	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Robotics and Automation in Life Sciences, Software Architecture for Robotic and Automation, Biological Cell Manipulation Abstract: Laboratory automation is a key driver for higher efficiency and reproducibility of experiments and measurements in natural science laboratories. One process that is particularly susceptible to both manual errors in the physical handling of labware, faulty data analyses, and incomplete reporting is the quantitative Polymerase Chain Reaction (qPCR). It is a ubiquitous analysis method in biolaboratories to amplify and measure the amount of a specific DNA sequence in a sample. Our system, which we call the qPCRBot, addresses these issues through three key pillars: automating data analysis and handling processes, standardizing data management and system communication protocols, and utilizing a robotic manipulator for labware transport. To achieve this, we developed a SiLA 2-based client-server architecture for unified and standardized access to both the qPCR device and the robot. For the manipulator, we implemented a Cartesian motion generator to ensure proper labware transport. We transform all experiment data to a standardized, XML-based format and integrate a widely-used Laboratory Information Management System for its storage. These developments collectively enable streamlined qPCR measurements without human interaction, thus enhancing both efficiency and reproducibility.

16:40-16:45, Paper ThET3.2	Add to My Program
Distributed Pursuit of an Evader with Adaptive Robust Path Control under State Measurement Uncertainty

Rao, Kai	East China University of Science and Technology
Yan, Huaicheng	East China University of Science and Technology
Huang, Zhihao	East China University of Science and Technology
Yang, Penghui	East China University of Science and Technology
Lv, Yunkai	East China University of Science and Technology
Keywords: Surveillance Robotic Systems, Search and Rescue Robots, Multi-Robot Systems Abstract: This paper presents a distributed pursuit framework for environments with obstacles considering state measurement uncertainty. Our framework consists of two primary components: the computation of safe pursuit regions based on Voronoi cell (VC) and the solution of an adaptive robust path controller based on Control Barrier Function (CBF). Initially, the chance constrained obstacle-aware Voronoi cell (CCOVC) for each pursuer is constructed by calculating separation hyperplane and buffer terms. Subsequently, we formulate chance CBF and chance Control Lyapunov Function (CLF) constraints, using convex approximation to determine their upper bounds. We then find the adaptive robust path controller by solving a Quadratically Constrained Quadratic Program (QCQP). The advantage of this framework lies in its capability to adaptively compute the path controller and ensure robust collision avoidance among pursuers and with obstacles. Simulation and experimental results demonstrate the effectiveness and robustness of the proposed framework.

16:45-16:50, Paper ThET3.3	Add to My Program
Multimodal Behaviour Trees for Robotic Laboratory Task Automation

Fakhruldeen, Hatem	University of Liverpool
Raveendran Nambiar, Arvind	University of Liverpool
Veeramani, Satheeshkumar	University of Liverpool
Tailor, Bonilkumar Vijaykumar	University of Liverpool
Beyzaee Juneghani, Hadi	University of Liverpool
Pizzuto, Gabriella	University of Liverpool
Cooper, Andrew Ian	University of Liverpool
Keywords: Robotics and Automation in Life Sciences Abstract: Laboratory robotics offer the capability to conduct experiments with a high degree of precision and reproducibility, with the potential to transform scientific research. Trivial and repeatable tasks; e.g., sample transportation for analysis and vial capping are well-suited for robots; if done successfully and reliably, chemists could contribute their efforts towards more critical research activities. Currently, robots can perform these tasks faster than chemists, but how reliable are they? Improper capping could result in human exposure to toxic chemicals which could be fatal. To ensure that robots perform these tasks as accurately as humans, sensory feedback is required to assess the progress of task execution. To address this, we propose a novel methodology based on behaviour trees with multimodal perception. Along with automating robotic tasks, this methodology also verifies the successful execution of the task, a fundamental requirement in safety-critical environments. The experimental evaluation was conducted on two lab tasks: sample vial capping and laboratory rack insertion. The results show high success rate, i.e., 88% for capping and 92% for insertion, along with strong error detection capabilities. This ultimately proves the robustness and reliability of our approach and that using multimodal behaviour trees should pave the way towards the next generation of robotic chemists.

16:50-16:55, Paper ThET3.4	Add to My Program
A Hierarchical Graph-Based Terrain-Aware Autonomous Navigation Approach for Complementary Multimodal Ground-Aerial Exploration

Patel, Akash	Luleå University of Technology
Valdes Saucedo, Mario Alberto	Lulea University of Technology
Stathoulopoulos, Nikolaos	Luleå University of Technology
Sankaranarayanan, Viswa Narayanan	Lulea University of Techonology
Tevetzidis, Ilias	Luleå University of Technology
Kanellakis, Christoforos	LTU
Nikolakopoulos, George	Luleå University of Technology
Keywords: Search and Rescue Robots, Field Robots, Cooperating Robots Abstract: Autonomous navigation in unknown environments is a fundamental challenge in robotics, particularly in coordinating ground and aerial robots to maximize exploration efficiency. This paper presents a novel approach that utilizes a hierarchical graph to represent the environment, encoding both geometric and semantic traversability. The framework enables the robots to compute a shared confidence metric, which helps the ground robot assess terrain and determine when deploying the aerial robot will extend exploration. The robot's confidence in traversing a path is based on factors such as predicted volumetric gain, path traversability, and collision risk. A hierarchy of graphs is used to maintain an efficient representation of traversability and frontier information through multi-resolution maps. Evaluated in a real subterranean exploration scenario, the approach allows the ground robot to autonomously identify zones that are no longer traversable but suitable for aerial deployment. By leveraging this hierarchical structure, the ground robot can selectively share graph information on confidence-assessed frontier targets from parts of the scene, enabling the aerial robot to navigate beyond obstacles and continue exploration.

16:55-17:00, Paper ThET3.5	Add to My Program
Introducing Collaborative Robots As a First Step towards Autonomous Reprocessing of Medical Equipment

Voigt, Florian	Technical University of Munich
Naceri, Abdeldjallil	Technical University of Munich
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Robotics and Automation in Life Sciences, Bimanual Manipulation, Medical Robots and Systems Abstract: Ensuring the sterility of medical equipment, particularly endoscopes used in environments teeming with diverse pathogens and drug-resistant bacteria, is crucial for safe medical procedures. However, the complexity of endoscope reprocessing, which involves numerous dexterous manual manipulations, poses significant challenges. Achieving certification for sterilization requires precise, repetitive execution with strict tolerances. In this study, we propose a framework that automates the handling and storage of endoscopes right after the sterilization process and employs compliant collaborative robots to address these dexterous manipulation challenges. In the first stage, we identified the key manipulation skills involved in the process through observations and feedback from medical personnel. In the second stage, we proposed a system that employs a high-level action planner to orchestrate the removal and storage of endoscopes, integrating two collaborative robots and a linear unit. Through real-time force measurements, compliant control, task knowledge, and safety protocols, we establish a system that ensures the safety of both medical equipment and personnel in proximity. In our first experiment, we conducted 50 trials with a 100% reliability rate. Each trial had an execution time of 102 seconds, with a variance of 1.2 seconds. In our second experiment, we performed 10 trials with a human obstructing the transfer path, facing away from the robot. In all cases, the system successfully and promptly detected the collision. This work pioneers the automation of medical reprocessing in sterile environments using tactile robots and addresses the associated challenges.

17:00-17:05, Paper ThET3.6	Add to My Program
CloudTrack: Scalable UAV Tracking with Cloud Semantics

Blei, Yannik	University of Technology Nuremberg
Krawez, Michael	University of Technology Nuremberg
Nilavadi, Nisarga	University of Technology Nuremberg
Kaiser, Tanja Katharina	University of Technology Nuremberg
Burgard, Wolfram	University of Technology Nuremberg
Keywords: Search and Rescue Robots, Aerial Systems: Applications, Human Detection and Tracking Abstract: Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and rescue scenarios to gather information in the search area. The automatic identification of the person searched for in aerial footage could increase the autonomy of such systems, reduce the search time, and thus increase the missed person’s chances of survival. In this paper, we present a novel approach to perform semantically conditioned open vocabulary object tracking that is specifically designed to cope with the limitations of UAV hardware. Our approach has several advantages: It can run with verbal descriptions of the missing person, e.g., the color of the shirt, it does not require dedicated training to execute the mission, and can efficiently track a potentially moving person. Our experimental results demonstrate the versatility and efficacy of our approach. We publish the methods source code at https://github.com/utn-blei/CloudTrack.

17:05-17:10, Paper ThET3.7	Add to My Program
The Experiment Orchestration System (EOS): Comprehensive Foundation for Laboratory Automation

Angelopoulos, Angelos	University of North Carolina at Chapel Hill
Baykal, Cem	University of North Carolina at Chapel Hill
Kandel, Jade	University of North Carolina at Chapel Hill
Verber, Matthew	University of North Carolina at Chapel Hill
Cahoon, James	University of North Carolina at Chapel Hill
Alterovitz, Ron	University of North Carolina at Chapel Hill
Keywords: Robotics and Automation in Life Sciences, Software Architecture for Robotic and Automation, Foundations of Automation Abstract: As scientific research in chemistry, materials science, and applied sciences becomes increasingly complex and data-driven, there is a growing need for efficient, scalable, and flexible automation to accelerate discoveries and reduce human burden and error in laboratories. We introduce the Experiment Orchestration System (EOS), an open-source software framework and runtime offering a comprehensive foundation for laboratory automation. EOS offers an extensible framework allowing users to define labs, devices, tasks, experiments, and optimization criteria using YAML and Python plugins, and also offers a distributed runtime for managing and executing automation. EOS has a central orchestrator that communicates with and controls laboratory equipment to execute tasks. EOS implements autonomous experiment campaigns, parameter optimization, task scheduling, result aggregation, and more. By providing a common infrastructure for laboratory automation, EOS aims to reduce automation implementation barriers and accelerate discoveries in science laboratories.


ThET4 Regular Session, 304	Add to My Program
Bioinspiration and Biomimetics 3

Chair: Floreano, Dario	Ecole Polytechnique Fédérale De Lausanne (EPFL)
Co-Chair: Degani, Amir	Technion - Israel Institute of Technology

16:35-16:40, Paper ThET4.1	Add to My Program
Design of a Bioinspired Jumping Mechanism for Self-Takeoff of Flapping Robot

Pan, Erzhen	Harbin Institute of Technology, Shenzhen
Sun, Wei	Harbin Institute of Technology Shenzhen
Xu, Wenfu	Harbin Institute of Technology, Shenzhen
Keywords: Biologically-Inspired Robots, Biomimetics Abstract: Most birds in nature rely on jumping for take-off. Flapping-Wing robots can flap and fly like birds but require an operator to take off, which are unable to generate sufficient lift to maintain flight at a low airspeed and must accelerate to take-off speed in a short time. It poses a challenge for the design of the jumping mechanism. This study is inspired by the jump-takeoff of birds and designs a simple and lightweight jumping leg, which is capable of storing and releasing the energy with only one degree of freedom. In addition, a prototype was developed and tested, with a wingspan of 2 meters and a mass of 1.6 kilograms, accelerating to 4 m/s in 52 milliseconds by jumping, achieving the jumping take-off from the ground.

16:40-16:45, Paper ThET4.2	Add to My Program
Embodied Adaptive Sensing for Odor Concentration Maximization in Bio-Inspired Robotics

Homchanthanakul, Jettanan	Vidyasirimedhi Institute of Science and Technology
Shigaki, Shunsuke	National Institute of Informatics
Manoonpong, Poramate	Vidyasirimedhi Institute of Science and Technology (VISTEC)
Keywords: Biologically-Inspired Robots, Neural and Fuzzy Control, Legged Robots Abstract: Animals exhibit remarkable adaptability in sensing their environments, employing strategies that optimize information gathering. For instance, silk moths adjust their wing-flapping frequency to detect pheromones, while dogs modify their sniffing behavior by altering sniff height and frequency based on proximity to an odor source. Despite the potential to enhance odor detection for olfactory navigation by drawing inspiration from these natural mechanisms, many existing approaches focus on computationally intensive methods like multi-sensory integration or rely on multiple robots for odor localization, rather than leveraging embodied sensing. In this study, we propose an embodied adaptive sensing strategy that enhances odor detection by implementing an active odor sensor on a legged robot and applying a bio-inspired adaptive robot height control system for dynamically adapting the robot's height based on real-time gas concentration feedback. The control system employs a simple artificial hormone mechanism to regulate the robot height by processing gas concentration derivatives, mimicking biological adaptability. By utilizing the interaction between the active odor sensor, adaptive control system, and the legged body, this approach allows the robot to optimize its height online to capture the maximum gas concentration, thereby reducing the need for complex algorithms and high computational resources. As a result, it offers a more efficient solution for odor-driven tasks, with potential applications in real-world environments.

16:45-16:50, Paper ThET4.3	Add to My Program
SKOOTR: A SKating, Omni-Oriented, Tripedal Robot

Hung, Adam Joshua	University of Michigan
Enninful Adu, Challen	University of Michigan
Moore, Talia	University of Michigan
Keywords: Biologically-Inspired Robots, Biomimetics Abstract: In both animals and robots, locomotion capabilities are determined by the physical structure of the system. The majority of legged animals and robots are bilaterally symmetric, which facilitates locomotion with consistent headings and obstacle traversal, but leads to constraints in their turning ability. On the other hand, radially symmetric animals have demonstrated rapid turning abilities enabled by their omni-directional body plans. Radially symmetric tripedal robots are able to turn instantaneously, but are commonly constrained by needing to change direction with every step, resulting in inefficient and less stable locomotion. Inspired by the radial symmetry and maneuverability of brittle stars and octopuses, we introduce a novel design for a tripedal robot that has both frictional and rolling contacts. Additionally, a freely rotating central sphere provides an added contact point so the robot can retain a stable tripod base of support while lifting and pushing with any one of its legs. The SKating, Omni-Oriented, Tripedal Robot (SKOOTR) is more versatile and stable than existing tripedal robots. It is capable of multiple forward gaits, multiple turning maneuvers, obstacle traversal, and stair climbing. SKOOTR has been designed to facilitate customization for diverse applications: it is fully open-source, is constructed with 3D printed or off-the-shelf parts, and costs approximately 500 USD to build. A project page with CAD files, assembly guide, and links to the github repository is posted at https://www.embirlab.com/skootr.

16:50-16:55, Paper ThET4.4	Add to My Program
AllGaits: Learning All Quadruped Gaits and Transitions

Bellegarda, Guillaume	EPFL
Shafiee, Milad	EPFL
Ijspeert, Auke	EPFL
Keywords: Biologically-Inspired Robots, Legged Robots Abstract: We present a framework for learning a single policy capable of producing all quadruped gaits and transitions. The framework consists of a policy trained with deep reinforcement learning (DRL) to modulate the parameters of a system of abstract oscillators (i.e. Central Pattern Generator), whose output is mapped to joint commands through a pattern formation layer that sets the gait style, i.e. body height, swing foot ground clearance height, and foot offset. Different gaits are formed by changing the coupling between different oscillators, which can be instantaneously selected at any velocity by a user. With this framework, we systematically investigate which gait should be used at which velocity, and when gait transitions should occur from a Cost of Transport (COT), i.e. energy-efficiency, point of view. Additionally, we note how gait style changes as a function of locomotion speed for each gait to keep the most energy-efficient locomotion. While the currently most popular gait (trot) does not result in the lowest COT, we find that considering different co-dependent metrics such as mean base angular velocity and joint acceleration result in different 'optimal' gaits than those that minimize COT. We deploy our controller in various hardware experiments, focusing on 9 quadruped animal gaits, and demonstrate generalizability to novel and unseen gaits during training, and robustness to leg failures.

16:55-17:00, Paper ThET4.5	Add to My Program
Bird-Inspired Tendon Coupling Improves Paddling Efficiency by Shortening Phase Transition Times

Lin, Jianfeng	Georgia Institute of Technology
Guo, Zhao	Wuhan University
Badri-Spröwitz, Alexander	Max Planck Institute for Intelligent Systems
Keywords: Biologically-Inspired Robots, Biomimetics, Tendon/Wire Mechanism Abstract: Drag-based swimming with rowing appendages, fins, and webbed feet is a widely adapted locomotion form in aquatic animals. To develop effective underwater and swimming vehicles, a wide range of bioinspired drag-based paddles have been proposed, often faced with a trade-off between propulsive efficiency and versatility. Webbed feet provide an effective propulsive force in the power phase, are light weight and robust, and can even be partially folded away in the recovery phase. However, during the transition between recovery and power phase, much time is lost folding and unfolding, leading to drag and reducing efficiency. In this work, we took inspiration from the coupling tendons of aquatic birds and utilized tendon coupling mechanisms to shorten the transition time between recovery and power phase. Results from our hardware experiments show that the proposed mechanisms improve propulsive efficiency by 2.0 and 2.4 times compared to a design without extensor tendons or based on passive paddle, respectively. We further report that distal leg joint clutching, which has been shown to improve efficiency in terrestrial walking, did not play an major role in swimming locomotion. In sum, we describe a new principle for an efficient, drag-based leg and paddle design, with potential relevance for the swimming mechanics in aquatic birds.

17:00-17:05, Paper ThET4.6	Add to My Program
A Bio-Inspired Sand-Rolling Robot: Effect of Body Shape on Sand Rolling Performance

Liao, Xingjue	University of Southern California
Liu, Wenhao	University of Southern California
Wu, Hao	University of Southern California
Qian, Feifei	University of Southern California
Keywords: Biologically-Inspired Robots, Biomimetics, Passive Walking Abstract: The capability of effectively moving on complex terrains such as sand and gravel can empower our robots to robustly operate in outdoor environments, and assist with critical tasks such as environment monitoring, search-and-rescue, and supply delivery. Inspired by the Mount Lyell salamander's ability to curl its body into a loop and effectively roll down hill slopes, in this study we develop a sand-rolling robot and investigate how its locomotion performance is governed by the shape of its body. We experimentally tested three different body shapes: Hexagon, Quadrilateral, and Triangle. We found that Hexagon and Triangle can achieve a faster rolling speed on sand, but exhibited more frequent failures of getting stuck. Analysis of the interaction between robot and sand revealed the failure mechanism: the deformation of the sand produced a local ``sand incline'' underneath robot contact segments, increasing the effective region of supporting polygon (ERSP) and preventing the robot from shifting its center of mass (CoM) outside the ERSP to produce sustainable rolling. Based on this mechanism, a highly-simplified model successfully captured the critical body pitch for each rolling shape to produce sustained rolling on sand, and informed design adaptations that mitigated the locomotion failures and improved robot speed by more than 200%. Our results provide insights into how locomotors can utilize different morphological features to achieve robust rolling motion across deformable substrates.

17:05-17:10, Paper ThET4.7	Add to My Program
A Programmable Substrate to Study Robots Jumping from Non-Rigid Surfaces

Divi, Sathvik	Carnegie Mellon University
Yim, Justin K.	University of Illinois Urbana-Champaign
Bedillion, Mark	Carnegie Mellon University
Bergbreiter, Sarah	Carnegie Mellon University
Keywords: Biologically-Inspired Robots, Biomimetics, Compliance and Impedance Control Abstract: This study presents the development, characterization, and demonstration of a tunable substrate for small jumping robots. Jumping robots in the literature are typically evaluated when jumping from rigid surfaces, in contrast to surfaces with more significant compliance or damping that are encountered in the natural world. The aim of this work is to create a physical substrate, or 'ground', for which the effective mass, compliance, and damping can be programmed. This system enables quick testing of various substrate conditions and also allows for the introduction of complex nonlinearities to analyze the interactions between latch-mediated spring actuation (LaMSA) systems and their environment. A mathematical model for the substrate is defined and the system is built with a fast brushless DC motor and controller running on a real-time target machine. The results illustrate the range of compliance and damping that can be achieved, as well as example jumps from the substrate using a 4 g jumper and a 108 g jumping robot.


ThET5 Regular Session, 305	Add to My Program
Learning for Legged Locomotion 2

Chair: Havoutis, Ioannis	University of Oxford
Co-Chair: Daniel, Mélodie	LaBRI - Université De Bordeaux

16:35-16:40, Paper ThET5.1	Add to My Program
Fine-Tuning Hard-To-Simulate Objectives for Quadruped Locomotion: A Case Study on Total Power Saving

Nai, Ruiqian	Tsinghua University
You, Jiacheng	Tsinghua University
Cao, Liu	Tsinghua University
Cui, Hanchen	University of Minnesota Twin Cities
Zhang, Shiyuan	Tsinghua University
Xu, Huazhe	Tsinghua University
Gao, Yang	Tsinghua University
Keywords: Reinforcement Learning, Legged Robots Abstract: Legged locomotion is not just about mobility; it also encompasses crucial objectives such as energy efficiency, safety, and user experience, which are vital for real-world applications. However, key factors such as battery power consumption and stepping noise are often inaccurately modeled or missing in common simulators, leaving these aspects poorly optimized or unaddressed by current sim-to-real methods. Hand-designed proxies, such as mechanical power and foot contact forces, have been used to address these challenges but are often problem-specific and inaccurate. In this paper, we propose a data-driven framework for fine-tuning locomotion policies, targeting these hard-to-simulate objectives. Our framework leverages real-world data to model these objectives and incorporates the learned model into simulation for policy improvement. We demonstrate the effectiveness of our framework on power saving for quadruped locomotion, achieving a significant 24-28% net reduction in total power consumption from the battery pack at various speeds. In essence, our approach offers a versatile solution for optimizing hard-to-simulate objectives in quadruped locomotion, providing an easy-to-adapt paradigm for continual improving with real-world knowledge.

16:40-16:45, Paper ThET5.2	Add to My Program
Think on Your Feet: Seamless and Command-Adaptive Transition between Human-Like Locomotions

Huang, Huaxing	Noetix Robotics
Cui, Wenhao	Noetix
Zhang, Tonghe	Tsinghua University
Li, Shengtao	Noetix
Han, Jinchao	Noetix
Qin, Bangyu	Noetix Robotics
Zheng, Liang	Noetix
Tang, Ziyang	Noetix Robotics
Zhang, Tianchu	Noetix Robotics
Hu, Chenxu	Tsinghua University
Zhang, Shipu	Noetix Robotics
Jiang, Zheyuan	NOETIX Robotics
Keywords: Reinforcement Learning, Imitation Learning, Humanoid and Bipedal Locomotion Abstract: While it is relatively easier to train humanoid robots to mimic specific locomotion skills, it is more challenging to learn from various motions and adhere to continuously changing commands. These robots must accurately track motion instructions, seamlessly transition between a variety of movements,} and master intermediate motions not present in their reference data. In this work, we propose a novel approach that integrates human-like motion transfer with precise velocity tracking by a series of improvements to classical imitation learning. To enhance generalization, we employ the Wasserstein divergence criterion (WGAN-div). Furthermore, a Hybrid Internal Model provides structured estimates of hidden states and velocity to enhance mobile stability and environment adaptability, while a curiosity bonus fosters exploration. Our comprehensive method promises highly human-like locomotion that adapts to varying velocity requirements, direct generalization to unseen motions and multitasking, as well as zero-shot transfer to the simulator and the real world across different terrains. These advancements are validated through simulations across various robot models and extensive real-world experiments.

16:45-16:50, Paper ThET5.3	Add to My Program
RINA: Rapid Introspective Neural Adaptation for Out-Of-Distribution Payload Configurations on Quadruped Robots

Youngquist, Oscar	University of Massachusetts Amherst
Zhang, Hao	University of Massachusetts Amherst
Keywords: Machine Learning for Robot Control, Legged Robots, Deep Learning Methods Abstract: Adaptive locomotion is a fundamental capability for quadruped robots, particularly in real-world scenarios when they must transport novel or out-of-distribution (O.O.D.) payloads across diverse terrains. Previous learning-based methods often tightly couple a locomotion controller's learned parameters with the adaptation process, which requires extensive pre-training or slow online updates when encountering O.O.D. payloads. To enable adaptation of quadruped locomotion to O.O.D. payloads, we propose the novel Rapid Introspective Neural Adaptation (RINA) method that rapidly compensates for differences between expected and actual joint torques caused by O.O.D. payloads. RINA introduces an adaptive residual dynamics representation that decouples the learning model's parameters from those used for adaptation. A new neural operator network is introduced to learn a set of basis functions as the learning model, which are combined using linear coefficients to predict residual dynamics. Then, these residual dynamics are used to adjust the locomotion controller's output, compensating for additional torques induced by the O.O.D. payload. During execution, the mixing coefficients can be rapidly and introspectively adapted on-the-go to generate joint torque compensations for O.O.D. payloads, while keeping the learned basis functions unchanged. Experimental results have demonstrated that our RINA approach well addresses on-the-go O.O.D. payload adaptation on varied natural terrains without collecting and retraining on additional data and outperforms baseline methods.

16:50-16:55, Paper ThET5.4	Add to My Program
Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

Liu, Dikai	NVIDIA
Zhang, Tianwei	Nanyang Technological University
Yin, Jianxiong	NVIDIA
See, Simon	NVIDIA
Keywords: Legged Robots Abstract: With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensor inputs becomes highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA), a novel transformer-based mechanism with masking for quadruped locomotion. It employs direct sensor-level attention to enhance the sensory-temporal understanding and handle different combinations of sensor data, serving as a foundation for incorporating unseen information. MSTA can effectively understand its states even with a large portion of missing information, and is flexible enough to be deployed on physical systems despite the long input sequence.

16:55-17:00, Paper ThET5.5	Add to My Program
Robust Robot Walker: Learning Agile Locomotion Over Tiny Traps

Zhu, Shaoting	Tsinghua University
Huang, Runhan	Tsinghua University
Mou, Linzhan	University of Pennsylvania
Zhao, Hang	Tsinghua University
Keywords: Legged Robots, Reinforcement Learning, AI-Based Methods Abstract: Quadruped robots must exhibit robust walking capabilities in practical applications. In this work, we propose a novel approach that enables quadruped robots to pass various small obstacles, or "tiny traps". Existing methods often rely on exteroceptive sensors, which can be unreliable for detecting such tiny traps. To overcome this limitation, our approach focuses solely on proprioceptive inputs. We introduce a two-stage training framework incorporating a contact encoder and a classification head to learn implicit representations of different traps. Additionally, we design a set of tailored reward functions to improve both the stability of training and the ease of deployment for goal-tracking tasks. To benefit further research, we design a new benchmark for tiny trap task. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness and robustness of our method. Appendix can be found in project page: https://robust-robot-walker.github.io/.

17:00-17:05, Paper ThET5.6	Add to My Program
FRASA: An End-To-End Reinforcement Learning Agent for Fall Recovery and Stand up of Humanoid Robots

Gaspard, Clément	LaBRI - University of Bordeaux
Duclusaud, Marc	LaBRI - University of Bordeaux
Passault, Grégoire	LaBRI
Daniel, Mélodie	LaBRI - Université De Bordeaux
Ly, Olivier	LaBRI - Bordeaux University
Keywords: Reinforcement Learning, Humanoid Robot Systems, Body Balancing Abstract: Humanoid robotics faces significant challenges in achieving stable locomotion and recovering from falls in dynamic environments. Traditional methods, such as Model Predictive Control (MPC) and Key Frame Based (KFB) routines, either require extensive fine-tuning or lack real-time adaptability. This paper introduces FRASA, a Deep Reinforcement Learning (DRL) agent that integrates fall recovery and stand up strategies into a unified framework. Leveraging the Cross-Q algorithm, FRASA significantly reduces training time and offers a versatile recovery strategy that adapts to unpredictable disturbances. Comparative tests on Sigmaban humanoid robots demonstrate FRASA superior performance against the KFB method deployed in the RoboCup 2023 by the Rhoban Team, world champion of the KidSize League.

17:05-17:10, Paper ThET5.7	Add to My Program
DreamFLEX: Learning Fault-Aware Quadrupedal Locomotion Controller for Anomaly Situation in Rough Terrains

Lee, Seunghyun	KAIST (Korea Advanced Institute of Science and Technology)
Nahrendra, I Made Aswin	KAIST
Lee, Dongkyu	KAIST
Yu, Byeongho	KAIST
Oh, Minho	KAIST
Lee, Hyeonwoo	KAIST (Korea Advanced Institute of Science and Technology)
Myung, Hyun	KAIST (Korea Advanced Institute of Science and Technology)
Keywords: Legged Robots, Reinforcement Learning, Robust/Adaptive Control Abstract: Recent advances in quadrupedal robots have demonstrated impressive agility and the ability to traverse diverse terrains. However, hardware issues, such as motor overheating or joint locking, may occur during long-distance walking or traversing through rough terrains and lead to locomotion failures. Although several studies have proposed fault-tolerant control methods for quadrupedal robots, there are still challenges in traversing unstructured terrains. In this paper, we propose DreamFLEX, a robust fault-tolerant locomotion controller that enables a quadrupedal robot to traverse complex environments even under joint failure condition. DreamFLEX integrates an explicit failure estimation and modulation network that jointly estimates the robot's joint fault vector and utilizes this information to adapt the locomotion pattern to faulty conditions in real-time, enabling quadrupedal robots to maintain stability and performance in rough terrains. Experimental results demonstrate that DreamFLEX outperforms existing methods in both simulation and real-world scenarios, effectively managing hardware failures while maintaining robust locomotion performance.

17:10-17:15, Paper ThET5.8	Add to My Program
Curriculum-Based Reinforcement Learning for Quadrupedal Jumping: A Reference-Free Design

Atanassov, Vassil	University of Oxford
Ding, Jiatao	Delft University of Technology
Kober, Jens	TU Delft
Havoutis, Ioannis	University of Oxford
Della Santina, Cosimo	TU Delft
Keywords: Legged Robots, Reinforcement Learning, Machine Learning for Robot Control Abstract: Deep reinforcement learning (DRL) has emerged as a promising solution to mastering explosive and versatile quadrupedal jumping skills. However, current DRL-based frameworks usually rely on pre-existing reference trajectories obtained by capturing animal motions or transferring experience from existing controllers. This work aims to prove that learning dynamic jumping is possible without relying on imitating a reference trajectory by leveraging a curriculum design. Starting from a vertical in-place jump, we generalize the learned policy to forward and diagonal jumps and, finally, we learn to jump across obstacles. Conditioned on the desired landing location, orientation, and obstacle dimensions, the proposed approach yields a wide range of omnidirectional jumping motions in real-world experiments. Particularly we achieve a 90cm forward jump, exceeding all previous records for similar robots reported in the existing literature. Additionally, the robot can reliably execute continuous jumping on soft grassy grounds, which is especially remarkable as such conditions were not included in the training stage.


ThET6 Regular Session, 307	Add to My Program
Perception for Manipulation 4

Chair: Liu, Katherine	Toyota Research Institute
Co-Chair: Gaidon, Adrien	Toyota Research Institute

16:35-16:40, Paper ThET6.1	Add to My Program
OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World

Liu, Katherine	Toyota Research Institute
Zakharov, Sergey	Toyota Research Institute
Chen, Dian	Toyota Research Institute
Ikeda, Takuya	Woven by Toyota, Inc
Shakhnarovich, Gregory	Toyota Technological Institute at Chicago
Gaidon, Adrien	Toyota Research Institute
Ambrus, Rares	Toyota Research Institute
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation Abstract: We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets.

16:40-16:45, Paper ThET6.2	Add to My Program
Self-Supervised Learning of Reconstructing Deformable Linear Objects under Single-Frame Occluded View

Wang, Song	Tsinghua University
Shen, Guanghui	Tsinghua University
Wu, Shirui	Tsinghua University
Wu, Dan	Tsinghua University
Keywords: Perception for Grasping and Manipulation, RGB-D Perception, Deep Learning for Visual Perception Abstract: Deformable linear objects (DLOs), such as ropes,cables, and rods, are common in various scenarios, and accurate occlusion reconstruction of them is crucial for effective robotic manipulation. Previous studies for DLO reconstruction either rely on supervised learning, which is limited by the availability of labeled real-world data, or geometric approaches, which fail to capture global features and often struggle with occlusions and complex shapes. This paper presents a novel DLO occlusion reconstruction framework that integrates self-supervised point cloud completion with traditional techniques like clustering, sorting, and fitting to generate ordered key points. A memory module is proposed to enhance the self-supervised training process by consolidating prototype information, while DLO shape constraints are utilized to improve reconstruction accuracy. Experimental results on both synthetic and real-world datasets demonstrate that our method outperforms state-of the-art algorithms, particularly in scenarios involving complex occlusions and intricate self-intersections.

16:45-16:50, Paper ThET6.3	Add to My Program
PseudoTouch: Efficiently Imaging the Surface Feel of Objects for Robotic Manipulation

Röfer, Adrian	University of Freiburg
Heppert, Nick	University of Freiburg
Ayad, Abdallah	University of Freiburg
Chisari, Eugenio	University of Freiburg
Valada, Abhinav	University of Freiburg
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Representation Learning Abstract: Tactile sensing is vital for human dexterous manipulation, however, it has not been widely used in robotics. Compact, low-cost sensing platforms can facilitate a change, but unlike their popular optical counterparts, they are difficult to deploy in high-fidelity tasks due to their low signal dimensionality and lack of a simulation model. To overcome these challenges, we introduce PseudoTouch which links high-dimensional structural information to low-dimensional sensor signals. It does so by learning a low-dimensional visual-tactile embedding, wherein we encode a depth patch from which we decode the tactile signal. We collect and train PseudoTouch on a dataset comprising aligned tactile and visual data pairs obtained through random touching of eight basic geometric shapes. We demonstrate the utility of our trained PseudoTouch model in two downstream tasks: object recognition and grasp stability prediction. In the object recognition task, we evaluate the learned embedding's performance on a set of five basic geometric shapes and five household objects. Using PseudoTouch, we achieve an object recognition accuracy 84% after just ten touches, surpassing a proprioception baseline. For the grasp stability task, we use ACRONYM labels to train and evaluate a grasp success predictor using PseudoTouch's predictions derived from virtual depth information. Our approach yields a 32% absolute improvement in accuracy compared to the baseline relying on partial point cloud data. We make the data, code, and trained models publicly available at https://pseudotouch.cs.uni-freiburg.de.

16:50-16:55, Paper ThET6.4	Add to My Program
Segment Any Repeated Object

Liu, Yushi	University Tübingen
Graf, Christian	Robert Bosch GmbH
Spies, Markus	Bosch Center for Artificial Intelligence
Keuper, Margret	University of Mannheim
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization, Semantic Scene Understanding Abstract: Understanding a scene in terms of objects and their properties is fundamental for various vision-based robotic applications, including item picking. To effectively clear a bin, a robot must comprehend objects as graspable entities, often without prior access to models of the target object. This study focuses on open world object segmentation with the additional requirement of assigning identical class labels for repeated instances of the same object. This capability enables item picking tasks with homogeneous bins, filtering out packaging material, and sorting tasks. We propose a novel pipeline for detecting repeated instances of identical objects, building on recent advancements in vision foundation models and exploring approaches for estimating object similarities based on feature embeddings or keypoint correspondence matching. Through a comprehensive experimental evaluation, we establish a new state-of-the-art on ARMBench repeated objects segmentation, a particularly challenging open problem in bin-picking robotics. Additionally, we demonstrate the real-world application of our method integrated into a robot picking cell to showcase its relevance to industrial use cases.

16:55-17:00, Paper ThET6.5	Add to My Program
ViTa-Zero: Zero-Shot Visuotactile Object 6D Pose Estimation

Li, Hongyu	Brown University
Akl, James	Amazon
Sridhar, Srinath	Brown University
Brady, Tye	Amazon
Padir, Taskin	Northeastern University
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Sensor Fusion Abstract: Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring–mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

17:00-17:05, Paper ThET6.6	Add to My Program
DoorBot: Closed-Loop Task Planning and Manipulation for Door Opening in the Wild with Haptic Feedback

Wang, Zhi	UIUC
Mo, Yuchen	University of Illinois, Urbana-Champaign
Jin, Shengmiao	University of Illinois Urbana-Champaign
Yuan, Wenzhen	University of Illinois
Keywords: Force and Tactile Sensing, Mobile Manipulation, Perception for Grasping and Manipulation Abstract: Robots operating in unstructured environments face significant challenges when interacting with everyday objects like doors. They particularly struggle to generalize across diverse door types and conditions. Existing vision-based and open-loop planning methods often lack the robustness to handle varying door designs, mechanisms, and push/pull configurations. In this work, we propose a haptic-aware closed-loop hierarchical control framework that enables robots to explore and open different unseen doors in the wild. Our approach leverages real-time haptic feedback, allowing the robot to adjust its strategy dynamically based on force feedback during manipulation. We test our system on 20 unseen doors across different buildings, featuring diverse appearances and mechanical types. Our framework achieves a 90% success rate, demonstrating its ability to generalize and robustly handle varied door-opening tasks. This scalable solution offers potential applications in broader open-world articulated object manipulation tasks.

17:05-17:10, Paper ThET6.7	Add to My Program
SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-To-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery

Xu, Jialang	University College London
Sirajudeen, Nazir	University College London
Boal, Matthew	The Griffin Institute
Francis, Nader	The Griffin Institute
Stoyanov, Danail	University College London
Mazomenos, Evangelos	UCL
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Visual Learning Abstract: Automated detection of surgical errors can improve robotic-assisted surgery. Despite promising progress, existing methods still face challenges in capturing rich temporal context to establish long-term dependencies while maintaining computational efficiency. In this paper, we propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection, facilitating efficient long sequence modelling with linear complexity. SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos. The bottleneck mechanism compresses and restores features within their spatial dimension, thereby reducing computational complexity. FCTF utilizes multiple dilated 1D convolutional layers to merge temporal information across diverse scale ranges, accommodating errors of varying duration. Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases. Specifically, we deploy the clinically validated observational clinical human reliability assessment tool (OCHRA) to annotate the errors during suturing tasks in an open-source radical prostatectomy dataset (SAR-RARP50). Experimental results demonstrate that our SEDMamba outperforms state-of-the-art methods with at least 1.82% AUC and 3.80% AP performance gains with significantly reduced computational complexity. The corresponding error annotations, code and models will be released at https://github.com/wzjialang/SEDMamba.


ThET7 Regular Session, 309	Add to My Program
Deep Learning Applications

Chair: Ostyn, Frederik	Ghent University
Co-Chair: Attali, Amnon	University of Illinois at Urbana-Champaign

16:35-16:40, Paper ThET7.1	Add to My Program
Automated Generation of Transformations to Mitigate Sensor Hardware Migration in ADS

Von Stein, Meriel	University of Virginia
Elbaum, Sebastian	University of Virginia
Wang, Hongning	University of Virginia
Keywords: Sensor-based Control, Deep Learning Methods, Autonomous Vehicle Navigation Abstract: Autonomous driving systems (ADSs) rely on massive amounts of sensed data to train their underlying machine-learned components. Common sensor hardware migrations can render an existing machine-learned pipeline inadequate. This necessitates the development of bespoke transformations to adapt new sensor data to the old learned model, or the retraining of a new model with new sensor data. These solutions are expensive, often performed reactively to sensor hardware migration, and rely on empirical reconstruction and validation metrics only which lack knowledge of the features important to the learned model. To address these challenges, we propose PreFixer, a technique that can systematically generate transformations for many types of sensor hardware migration during the ADS development lifecycle. PreFixer collects small datasets using colocated new and old sensors, and then uses that data and the output of the learned model to train an augmented encoder to learn a transformation that maps new sensor data to old sensor data. The trained encoder can then be deployed as a preprocessor to the old learned model. Our study shows that, for a common set of camera sensor hardware migrations, PreFixer can match or improve the performance of the best-performing baseline technique in terms of distance travelled safely with 10% of the training dataset, and take at most half of the training time.

16:40-16:45, Paper ThET7.2	Add to My Program
Probabilistic Latent Variable Modeling for Dynamic Friction Identification and Estimation

Vantilborgh, Victor	Ghent University
De Witte, Sander	Ghent University
Ostyn, Frederik	Ghent University
Lefebvre, Tom	Ghent University
Crevecoeur, Guillaume	Ghent University
Keywords: Industrial Robots, Deep Learning Methods, Probabilistic Inference Abstract: Precise identification of dynamic models in robotics is essential to support dynamic simulations, control design, friction compensation, output torque estimation, etc. A longstanding challenge remains in the development and identification of friction models for robotic joints, given the numerous physical phenomena affecting the underlying friction dynamics which result into nonlinear characteristics and hysteresis behaviour in particular. These phenomena proof difficult to be modelled and captured accurately using physical analogies alone. This has motivated researchers to shift from physics-based to data-driven models. Currently, these methods are still limited in their ability to generalize effectively to typical industrial robot deployement, characterized by high- and low-velocity operations and frequent direction reversals. Empirical observations motivate the use of dynamic friction models but these remain particulary challenging to establish. To address the current limitations, we propose to account for unidentified dynamics in the robot joints using latent dynamic states. The friction model may then utilize both the dynamic robot state and additional information encoded in the latent state to evaluate the friction torque. We cast this stochastic and partially unsupervised identification problem as a standard probabilistic representation learning problem. In this work both the friction model and latent state dynamics are parametrized as neural networks and are integrated in the conventional lumped parameter dynamic robot model. The complete dynamics model is directly learned from the noisy encoder measurements in the robot joints. We use the Expectation-Maximisation (EM) algorithm to find a Maximum Likelihood Estimate (MLE) of the model parameters. The effectiveness of the proposed method is validated in terms of open-loop prediction accuracy in comparison with baseline methods, using the Kuka KR6 R700 as a test platform.

16:45-16:50, Paper ThET7.3	Add to My Program
Learning Three-Dimensional Bin Packing with Adjustable-Order Semi-Online Setting

Yin, Hao	Southwest Jiaotong University
Zhang, Chenxi	Southwest Jiaotong University
Chen, Fan	Southwest Jiaotong University
He, Hongjie	Southwest Jiaotong University
Keywords: Reinforcement Learning, Deep Learning Methods, Industrial Robots Abstract: The online setting brings greater flexibility and practicality to the three-dimensional bin packing problem (3D-BPP) but at the cost of algorithm performance. Existing methods mitigate the performance impact by introducing semi-online settings with look-ahead or buffer zones. However, these methods either fail to fundamentally alter the packing order or reduce packing efficiency. This paper proposes a novel semi-online setting that allows for the observation of multiple items and the selection of one for packing, thereby adjusting the packing order without reducing packing efficiency. We do work for solving the semi-online packing problem via reinforcement learning which faces two real-world challenges: (1) a variable and difficult-to-predict number of observed items, and (2) the obstruction of robotic arm movement by already packed items. On the one hand, we design a policy network capable of adapting to variable item quantities. On the other hand, we introduce a guided bottom-up packing reward function to free up space for robotic arm motion. We show that our method outperforms the baselines in terms of space utilization with the condition of observing at least two items. Further experiments demonstrate the functionality of our reward function, which can guide a virtual robot to complete packing tasks.

16:50-16:55, Paper ThET7.4	Add to My Program
Multiple Rotation Averaging with Constrained Reweighting Deep Matrix Factorization

Li, Shiqi	Xi'an Jiaotong University
Zhu, Jihua	Xi'an Jiaotong University
Xie, Yifan	Xi'an Jiaotong University
Hu, Naiwen	Xi'an Jiaotong University
Zhu, Mingchen	University of California, Davis
Li, Zhongyu	Xi'an Jiaotong University
Wang, Di	Xi'an Jiaotong University
Lu, Huimin	Southeast University
Keywords: SLAM, Deep Learning for Visual Perception Abstract: Multiple rotation averaging plays a crucial role in computer vision and robotics domains. The conventional optimization-based methods optimize a nonlinear cost function based on certain noise assumptions, while most previous learning-based methods require ground truth labels in the supervised training process. Recognizing the handcrafted noise assumption may not be reasonable in all real-world scenarios, this paper proposes an effective rotation averaging method for mining data patterns in a learning manner while avoiding the requirement of labels. Specifically, we apply deep matrix factorization to directly solve the multiple rotation averaging problem in free linear space. For deep matrix factorization, we design a neural network model, which is explicitly low-rank and symmetric to better suit the background of multiple rotation averaging. Meanwhile, we utilize a spanning tree-based edge filtering to suppress the influence of rotation outliers. What's more, we also adopt a reweighting scheme and dynamic depth selection strategy to further improve the robustness. Our method synthesizes the merit of both optimization-based and learning-based methods. Experimental results on various datasets validate the effectiveness of our proposed method.

16:55-17:00, Paper ThET7.5	Add to My Program
Magnetometer-Calibrated Hybrid Transformer for Robust Inertial Tracking in Robotics

Zheng, Xinzhe	The University of Hong Kong
Ji, Sijie	California Institute of Technology
Pan, Yipeng	The University of Hong Kong
Zhang, Kaiwen	The University of Hong Kong
Pan, Jia	University of Hong Kong
Wu, Chenshu	The University of Hong Kong
Keywords: Localization, Deep Learning Methods Abstract: Inertial tracking is vital for autonomous robots and has gained popularity with the ubiquity of low-cost Inertial Measurement Units (IMUs) and deep learning-powered tracking algorithms. Existing works, however, have not fully utilized IMU measurements, particularly magnetometers, nor maximized the potential of deep learning to achieve the desired accuracy. To bridge the gap, we introduce NeurIT, which employs a Time-Frequency Block-recurrent Transformer (TF-BRT) at its core, combining RNN and Transformer to learn both time-frequency representative features. To fully utilize IMU information, we strategically employ differentiation of body-frame magnetometers for orientation calibration in a sensor fusion manner. Experiments conducted in diverse environments show that NeurIT maintains a mere 1-meter tracking error over a 300-meter distance, surpassing state-of-the-art baselines by 48.21% on unseen data. NeurIT also performs comparably to the visual-inertial approach (Tango Phone) in vision-favored conditions and surpasses it in plain environments. We share the code and data to promote further research: https://github.com/aiot-lab/NeurIT.

17:00-17:05, Paper ThET7.6	Add to My Program
MotionGlot: A Multi-Embodied Motion Generation Model

Harithas, Sudarshan S	Brown University
Sridhar, Srinath	Brown University
Keywords: AI-Enabled Robotics, AI-Based Methods, Representation Learning Abstract: This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motion related tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.

17:05-17:10, Paper ThET7.7	Add to My Program
Retinex-BEVFormer: Using Retinex to Enhance Multi-View Image-Based BEV Detector in Low Light Scenes

Liu, Xuan	Beihang University
Xiong, Zhongxia	Beihang University
Yao, Ziying	Beihang University
Wu, Xinkai	Beihang University
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception Abstract: Multi-view image-based BEV (Bird's Eye View) 3D perception is gaining attention as an alternative to high-cost LiDAR systems and has achieved notable success. However, there is a significant safety concern for future image-based BEV autonomous driving in low-light conditions (such as nighttime) while the limited research on BEV detectors for these scenes. In this paper, we attempt to enhance low-light BEV perception with illumination-guided feature fusion. We propose Retinex-BEVFormer, which uses illumination information generated by the Retinex theory to enhance the model's robustness to varying lighting conditions and improve detection performance in low-light scenes. Additionally, to address the illumination estimation discontinuity from multi-view images that can adversely affect detection, we propose the MVB-Retinex module, which balances illumination estimation by leveraging overlapping regions between adjacent images. Notably, our proposed method is a plug-and-play module that can be applied to any image-based BEV detector method and does not require any additional ground truth supervision. We conduct extensive experiments on the Nuscenes dataset, validating our algorithm in nighttime and daytime scenes. Compared to the baseline, our algorithm achieves a 2.9% increase in mAP on the validation set with minimal computational cost, especially showing a 3.6% improvement in nighttime scene. The experiments demonstrate that our Retinex-BEVFormer effectively improves detection performance in low-light conditions and enhances performance under normal illumination, indicating increased robustness of the BEV detector.


ThET8 Regular Session, 311	Add to My Program
Collision Avoidance 2

Chair: Figueredo, Luis	University of Nottingham (UoN)
Co-Chair: Bylard, Andrew	Stanford University

16:35-16:40, Paper ThET8.1	Add to My Program
Reactive Collision Avoidance for Safe Agile Navigation

Saviolo, Alessandro	New York University
Picello, Niko	University of Padova
Mao, Jeffrey	New York University
Verma, Rishabh	New York University
Loianno, Giuseppe	New York University
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Aerial Systems: Applications Abstract: Reactive collision avoidance is essential for agile robots navigating complex and dynamic environments, enabling real-time obstacle response. However, this task is inherently challenging because it requires a tight integration of perception, planning, and control, which traditional methods often handle separately, resulting in compounded errors and delays. This paper introduces a novel approach that unifies these tasks into a single reactive framework using solely onboard sensing and computing. Our method combines nonlinear model predictive control with adaptive control barrier functions, directly linking perception-driven constraints to real-time planning and control. Constraints are determined by using a neural network to refine noisy RGB-D data, enhancing depth accuracy, and selecting points with the minimum time-to-collision to prioritize the most immediate threats. To maintain a balance between safety and agility, a heuristic dynamically adjusts the optimization process, preventing overconstraints in real time. Extensive experiments with an agile quadrotor demonstrate effective collision avoidance across diverse indoor and outdoor environments, without requiring environment-specific tuning or explicit mapping.

16:40-16:45, Paper ThET8.2	Add to My Program
Hardware-Accelerated Ray Tracing for Discrete and Continuous Collision Detection on GPUs

Sui, Sizhe	University of Texas, Austin
Sentis, Luis	The University of Texas at Austin
Bylard, Andrew	Stanford University
Keywords: Collision Avoidance, Computational Geometry, Motion and Path Planning Abstract: This paper presents a set of simple and intuitive robot collision detection algorithms that show substantial scaling improvements for high geometric complexity and large numbers of collision queries by leveraging hardware-accelerated ray tracing on GPUs. It is the first leveraging hardware-accelerated ray-tracing for direct volume mesh-to-mesh discrete collision detection and applying it to continuous collision detection. We introduce two methods: Ray-Traced Discrete-Pose Collision Detection for exact robot mesh to obstacle mesh collision detection, and Ray-Traced Continuous Collision Detection for robot sphere representation to obstacle mesh swept collision detection, using piecewise-linear or quadratic B-splines. For robot link meshes totaling 24k triangles and obstacle meshes of over 190k triangles, our methods were up to 2.8 times faster in batched discrete-pose queries than a state-of-the-art GPU-based method using a sphere robot representation. For the same obstacle mesh scene, our sphere-robot continuous collision detection was up to 7 times faster depending on trajectory batch size. We also performed detailed measurements of the volume coverage accuracy of various sphere/mesh pose/path representations to provide insight into the tradeoffs between speed and accuracy of different robot collision detection methods.

16:45-16:50, Paper ThET8.3	Add to My Program
Collision Avoidance in Model Predictive Control Using Velocity Damper

Haffemayer, Arthur	LAAS-CNRS
Jordana, Armand	New York University
De Matteïs, Ludovic	LAAS-CNRS
Wojciechowski, Krzysztof	LAAS-CNRS
Righetti, Ludovic	New York University
Lamiraux, Florent	CNRS
Mansard, Nicolas	CNRS
Keywords: Collision Avoidance, Optimization and Optimal Control Abstract: We propose an advanced method for controlling the motion of a manipulator robot with strict collision avoidance in dynamic environments, leveraging a velocity damper constraint. Unlike conventional distance-based constraints, which tend to saturate near obstacles to reach optimality, the velocity damper constraint considers both distance and relative velocity, ensuring a safer separation. This constraint is incorporated into a model predictive control framework and enforced as a hard constraint through analytical derivatives supplied to the numerical solver. The approach has been fully implemented on a Franka Emika Panda robot and validated through experimental trials, demonstrating effective collision avoidance during dynamic tasks and robustness to unmodeled disturbances. An efficient open-source implementation along examples are provided here: url{https://gepettoweb.laas.fr/articles/haffemayer2025.html}.

16:50-16:55, Paper ThET8.4	Add to My Program
On the Synthesis of Reactive Collision-Free Whole-Body Robot Motions: A Complementarity-Based Approach

Yao, Haowen	Technical Univerity of Munich
Laha, Riddhiman	Technical University of Munich
Sinha, Anirban	GE Aerospace Research
Hall, Jonas	Boston University
Figueredo, Luis	University of Nottingham (UoN)
Chakraborty, Nilanjan	Stony Brook University
Haddadin, Sami	Mohamed Bin Zayed University of Artificial Intelligence
Keywords: Optimization and Optimal Control, Whole-Body Motion Planning and Control, Reactive and Sensor-Based Planning Abstract: This paper is about generating motion plans for high degree-of-freedom systems that account for both static and dynamic collisions along the entire body. A particular class of mathematical programs with complementarity constraints become useful in this regard. Optimization-based planners can tackle confined space trajectory planning while being cognizant of robot and (mostly static) obstacle constraints. However, handling moving obstacles is non-trivial in a real-time setting. To this end, we present the FLIQC (Fast LInear Quadratic Complementarity based) motion planner. Our reactive planner employs a novel motion model that captures the entire rigid robot as well as the obstacle geometry and ensures non- penetration between the surfaces due to the imposed constraint. We perform thorough comparative studies with the state-of- the-art, which demonstrate improved performance. Extensive simulation and hardware experiments validate our claim of generating continuous and real-time motion plans at 1 kHz for modern collaborative robots with constant minimal parameters.

16:55-17:00, Paper ThET8.5	Add to My Program
Rapid Dynamic Obstacle Avoidance for UAVs Enhanced by DVS and Neuromorphic Computing

Wang, Siyang	Xi'an Jiaotong University
Yu, Sheng	Xi'an Jiaotong University
Liang, Tingbang	Xi'an Jiaotong University
Shi, Yilin	Xi’an Jiaotong University
Ma, Yongqiang	Xi'an Jiaotong University
Ren, Pengju	Xi'an Jiaotong University
Keywords: Collision Avoidance, Aerial Systems: Applications, Force Control Abstract: Achieving rapid and accurate dynamic obstacle avoidance is crucial for enhancing the survivability of unmanned aerial vehicles (UAVs) in hazardous conditions. To accomplish dynamic obstacle avoidance, sensors with high temporal resolution and efficient processing models are required. Dynamic vision sensors (DVS) fulfill the sensing requirements, while spiking neural networks (SNNs) address the processing demands. In this paper, we develop an end-to-end obstacle avoidance algorithm for UAVs using only a single monocular DVS as the sensor and further enhance accuracy and speed through our proposed mechanisms. The algorithm consists of three components: ego-motion compensation, an SNN model for movement analysis, and a force filter inspired by spiking neurons. In movement analysis, we propose the temporal potential pooling (TPP) and incremental event (EI) mechanisms to accelerate our SNN model. The real-flight experiments confirm that our algorithm achieves approximately 90% accuracy with a processing latency as low as 4ms on a GPU, surpassing state-of-the-art methods. Ablation studies show that the proposed method maintains high accuracy in movement detection while significantly reducing computational time. Our method operates in real-time, achieves high accuracy, and is feasible across a wide range of environments. Our code is available at https://github.com/AmperiaWang/oanet_s1 for reproducibility.

17:00-17:05, Paper ThET8.6	Add to My Program
Efficient Collision Detection Framework for Enhancing Collision-Free Robot Motion

Zhu, Xiankun	Tsinghua University
Xin, Yucheng	Tsinghua University
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Liu, Houde	Shenzhen Graduate School, Tsinghua University
Xia, Chongkun	Sun Yat-Sen University
Liang, Bin	Center for Artificial Intelligence and Robotics, Graduate School
Keywords: Collision Avoidance, Integrated Planning and Learning, Reactive and Sensor-Based Planning Abstract: Fast and efficient collision detection is essential for motion generation in robotics. In this paper, we propose an efficient collision detection framework based on the Signed Distance Field (SDF) of robots, seamlessly integrated with a self-collision detection module. Firstly, we decompose the robot's SDF using forward kinematics and leverage multiple extremely lightweight networks in parallel to efficiently approximate the SDF. Moreover, we introduce support vector machines to integrate the self-collision detection module into the framework, which we refer to as the SDF-SC framework. Using statistical features, our approach unifies the representation of collision distance for both SDF and self-collision detection. During this process, we maintain and utilize the differentiable properties of the framework to optimize collision-free robot trajectories. Finally, we develop a reactive motion controller based on our framework, enabling real-time avoidance of multiple dynamic obstacles. While maintaining high accuracy, our framework achieves inference speeds up to five times faster than previous methods. Experimental results on the Franka robotic arm demonstrate the effectiveness of our approach.

17:05-17:10, Paper ThET8.7	Add to My Program
Differentiable Composite Neural Signed Distance Fields for Robot Navigation in Dynamic Indoor Environments

Bukhari, Syed Talha	Purdue University
Lawson, Daniel	Purdue University
Qureshi, Ahmed H.	Purdue University
Keywords: Vision-Based Navigation, RGB-D Perception Abstract: Neural Signed Distance Fields (SDFs) provide a differentiable environment representation to readily obtain collision checks and well-defined gradients for robot navigation tasks. However, updating neural SDFs as the scene evolves entails re-training, which is tedious, time consuming, and inefficient, making it unsuitable for robot navigation with limited field-of-view in dynamic environments. Towards this objective, we propose a compositional framework of neural SDFs to solve robot navigation in indoor environments using only an onboard RGB-D sensor. Our framework embodies a dual mode procedure for trajectory optimization, with different modes using complementary methods of modeling collision costs and collision avoidance gradients. The primary stage queries the robot body's SDF, swept along the route to goal, at the obstacle point cloud, enabling swift local optimization of trajectories. The secondary stage infers the visible scene's SDF by aligning and composing the SDF representations of its constituents, providing better informed costs and gradients for trajectory optimization. The dual mode procedure combines the best of both stages, achieving a success rate of 98%, 14.4% higher than baseline with comparable amortized plan time on iGibson 2.0. We also demonstrate its effectiveness in adapting to real-world indoor scenarios.

17:10-17:15, Paper ThET8.8	Add to My Program
On the Evaluation of Collision Probability Along a Path

Paiola, Lorenzo	Istituto Italiano Di Tecnologia
Grioli, Giorgio	Istituto Italiano Di Tecnologia
Bicchi, Antonio	Fondazione Istituto Italiano Di Tecnologia
Keywords: Risk, Collision Avoidance, Probability and Statistical Methods, Robot Safety Abstract: Characterizing the risk of operations is a fundamental requirement in robotics, and a crucial ingredient of safe planning. The problem is multifaceted, with multiple definitions arising in the vast recent literature fitting different application scenarios and leading to different computational approaches. A basic element shared by most frameworks is the definition and evaluation of the probability of collision for a mobile object in an environment with obstacles. We observe that, even in basic cases, different interpretations are possible. This paper proposes an index we call Risk Density, which offers a theoretical link between conceptually distant assumptions about the interplay of single collision events along a continuous path. We show how this index can be used to approximate the collision probability in the case where the robot evolves along a nominal continuous curve from random initial conditions. Indeed under this hypothesis the proposed approximation outperforms some well-established methods either in accuracy or computational cost.


ThET9 Regular Session, 312	Add to My Program
Task and Motion Planning 4

Chair: Bera, Aniket	Purdue University
Co-Chair: Shkurti, Florian	University of Toronto

16:35-16:40, Paper ThET9.1	Add to My Program
Fast and Accurate Task Planning Using Neuro-Symbolic Language Models and Multi-Level Goal Decomposition

Kwon, Minseo	Ewha Womans University
Kim, Yaesol	Istituto Italiano Di Tecnologia
Kim, Young J.	Ewha Womans University
Keywords: Task Planning, Task and Motion Planning Abstract: In robotic task planning, symbolic planners using rule-based representations like PDDL are effective but struggle with long-sequential tasks in complicated environments due to exponentially increasing search space. Meanwhile, LLM-based approaches, which are grounded in artificial neural networks, offer faster inference and commonsense reasoning but suffer from lower success rates. To address the limitations of the current symbolic (slow speed) or LLM-based approaches (low accuracy), we propose a novel neuro-symbolic task planner that decomposes complex tasks into subgoals using LLM and carries out task planning for each subgoal using either symbolic or MCTS-based LLM planners, depending on the subgoal complexity. This decomposition reduces planning time and improves success rates by narrowing the search space and enabling LLMs to focus on more manageable tasks. Our method significantly reduces planning time while maintaining high success rates across task planning domains, as well as real-world and simulated robotics environments. More details are available at http://graphics.ewha.ac.kr/LLMTAMP/.

16:40-16:45, Paper ThET9.2	Add to My Program
OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics

Wang, Junhui	Macau University of Science and Technology
Huo, Dongjie	Beijing University of Chemical Technology
Xu, ZeHui	Harbin Institute of Technology
Shi, Yongliang	Tsinghua University
Yan, Yimin	University of Chinese Academy of Sciences
Wang, Yuanxin	Beijing Institute of Technology
Gao, Chao	University of Cambridge
Qiao, Yan	Macau University of Science and Technology
Zhou, Guyue	Tsinghua University
Keywords: Autonomous Vehicle Navigation, Task and Motion Planning, Engineering for Robotic Systems Abstract: The increasing demand for efficient last-mile delivery in smart logistics underscores the role of autonomous robots in enhancing operational efficiency and reducing costs. Traditional navigation methods, which depend on high-precision maps, are resource-intensive, while learning-based approaches often struggle with generalization in real-world scenarios. To address these challenges, this work proposes the Openstreetmap-enhanced oPen-air sEmantic Navigation (OPEN) system that combines foundation models with classic algorithms for scalable outdoor navigation. The system leverages OpenStreetMap (OSM) for flexible map representation, thereby eliminating the need for extensive pre-mapping efforts. It also employs Large Language Models (LLMs) to comprehend delivery instructions and Vision-Language Models (VLMs) for global localization, map updates, and house number recognition. To compensate the limitations of existing benchmarks that are inadequate for assessing last-mile delivery, this work introduces a new benchmark specifically designed for outdoor navigation in residential areas, reflecting the real-world challenges faced by autonomous delivery systems. Extensive experiments validate the effectiveness of the proposed system in enhancing navigation efficiency and reliability. To facilitate further research, our code and benchmark are publicly available.

16:45-16:50, Paper ThET9.3	Add to My Program
KARMA: Augmenting Embodied AI Agents with Long-And-Short Term Memory Systems

Wang, Zixuan	Institute of Automation, Chinese Academy of Sciences
Yu, Bo	Shenzhen Institute of Artificial Intelligence and Robotics for S
Zhao, Junzhe	Alibaba
Sun, Wenhao	Institute of Computing Technology, Chinese Academy of Sciences
Hou, Sai	Beijing Institute of Technology
Liang, Shuai	Institute of Computing Technology, Chinese Academy of Sciences (
Hu, Xing	Institute of Computing Technology, Chinese Academy of Sciences
Han, Yinhe	Institute of Computing Technology, Chinese Academy of Sciences
Gan, Yiming	Institute of Computing Technology, Chinese Academy of Sciences
Keywords: AI-Based Methods, Task Planning, Motion and Path Planning Abstract: 负责执行互连的、长序列的家务任务经常面临上下文记忆的困难，导致任务执行效率低下和错误。为了解决这个问题，我们引入了 KARMA，一种创新的记忆系统它集成了长期和短期记忆模块，增强大型语言模型（LLM）以进行规划通过记忆增强提示进行具身代理。业区分长期记忆和短期记忆，其中长期记忆捕获全面的 3D 场景图作为环境的表示，而短期记忆动态记录对象位置的变化，并且国家。这种双重记忆结构允许代理检索相关的过去场景体验，从而提高任务规划的准确性和效率。短期内存采用有效和自适应内存替换的策略，确保保留关键信息同时丢弃不太相关的数据

16:50-16:55, Paper ThET9.4	Add to My Program
Socratic Planner: Self-QA-Based Zero-Shot Planning for Embodied Instruction Following

Shin, Suyeon	Seoul National University
Jeon, Sujin	Seoul National University
Kim, Junghyun	Seoul National University
Kang, Gi-Cheon	Seoul National University
Zhang, Byoung-Tak	Seoul National University
Keywords: Task and Motion Planning, AI-Based Methods, Task Planning Abstract: Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates the Large Language Model (LLM) in performing self-questioning and answering, which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visually-grounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate real-world applicability through deployment on a physical robot.

16:55-17:00, Paper ThET9.5	Add to My Program
Hypergraph-Based Coordinated Task Allocation and Socially-Aware Navigation for Multi-Robot Systems

Wang, Weizheng	Purdue University
Bera, Aniket	Purdue University
Min, Byung-Cheol	Purdue University
Keywords: Task and Motion Planning, Deep Learning Methods, Multi-Robot Systems Abstract: A team of multiple robots seamlessly and safely working in human-filled public environments requires adaptive task allocation and socially-aware navigation that account for dynamic human behavior. Current approaches struggle with highly dynamic pedestrian movement and the need for flexible task allocation. We propose Hyper-SAMARL, a hypergraph-based system for multi-robot task allocation and socially-aware navigation, leveraging multi-agent reinforcement learning (MARL). Hyper-SAMARL models the environmental dynamics between robots, humans, and points of interest (POIs) using a hypergraph, enabling adaptive task assignment and socially-compliant navigation through a hypergraph diffusion mechanism. Our framework, trained with MARL, effectively captures interactions between robots and humans, adapting tasks based on real-time changes in human activity. Experimental results demonstrate that Hyper-SAMARL outperforms baseline models in terms of social navigation, task completion efficiency, and adaptability in various simulated scenarios.

17:00-17:05, Paper ThET9.6	Add to My Program
Bootstrapping Object-Level Planning with Large Language Models

Paulius, David	Brown University
Agostini, Alejandro	University of Innsbruck
Quartey, Benedict	Brown University
Konidaris, George	Brown University
Keywords: Task and Motion Planning, AI-Based Methods, Task Planning Abstract: We introduce a new method that extracts knowledge from a large language model (LLM) to produce object-level plans, which describe high-level changes to object state, and uses them to bootstrap task and motion planning (TAMP). Existing work uses LLMs to directly output task plans or generate goals in representations like PDDL. However, these methods fall short because they rely on the LLM to do the actual planning or output a hard-to-satisfy goal. Our approach instead extracts knowledge from an LLM in the form of plan schemas as an object-level representation called functional object-oriented networks (FOON), from which we automatically generate PDDL subgoals. Our method markedly outperforms alternative planning strategies in completing several pick-and-place tasks in simulation.

17:05-17:10, Paper ThET9.7	Add to My Program
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Wake, Naoki	Microsoft
Kanehira, Atsushi	Microsoft
Sasabuchi, Kazuhiro	Microsoft
Takamatsu, Jun	Microsoft
Ikeuchi, Katsushi	Microsoft
Keywords: Task and Motion Planning, Task Planning, Imitation Learning Abstract: We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos—objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in achieving real robots' operations from human demonstrations in a one-shot manner. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/

17:10-17:15, Paper ThET9.8	Add to My Program
Action Contextualization: Adaptive Task Planning and Action Tuning Using Large Language Models

Gupta, Sthithpragya	Ecole Polytechnique Federale De Lausanne
Yao, Kunpeng	Massachusetts Institute of Technology
Niederhauser, Loïc	EPFL
Billard, Aude	EPFL
Keywords: Task and Motion Planning, Task Planning, AI-Based Methods Abstract: Large Language Models (LLMs) present a promising frontier in robotic task planning by leveraging extensive human knowledge. Nevertheless, the current literature often overlooks the critical aspects of robots' adaptability and error correction. This work aims to overcome this limitation by enabling robots to modify their motions and select the most suitable task plans based on the context. We introduce a novel framework to achieve action contextualization, aimed at tailoring robot actions to the context of specific tasks, thereby enhancing adaptability through applying LLM-derived contextual insights. Our framework integrates motion metrics that evaluate robot performances for each motion to resolve redundancy in planning. Moreover, it supports online feedback between the robot and the LLM, enabling immediate modifications to the task plans and corrections of errors. An overall success rate of 81.25% has been achieved through extensive experimental validation. Finally, when integrated with dynamical system (DS)-based robot controllers, the robotic arm-hand system demonstrates its proficiency in autonomously executing LLM-generated motion plans for sequential table-clearing tasks, rectifying errors without human intervention, and showcasing robustness against external disturbances. Our proposed framework also features the potential to be integrated with modular control approaches, significantly enhancing robots' adaptability and autonomy in performing sequential tasks in the real world.


ThET10 Regular Session, 313	Add to My Program
Multi-Robot Systems and Tools

Chair: Wilson, Sean	Georgia Institute of Technology, Georgia Tech Research Institute
Co-Chair: Goldberg, Ken	UC Berkeley

16:35-16:40, Paper ThET10.1	Add to My Program
CognitiveOS: Large Multimodal Model Based System to Endow Any Type of Robot with Generative AI

Lykov, Artem	Skolkovo Institute of Science and Technology
Konenkov, Mikhail	Skolkovo Institute of Science and Technology
Gbagbe, Koffivi Fidele	Skolkovo Institute of Science and Technology
Litvinov, Mikhail	Skolkovo Institute of Science and Technology
Davletshin, Denis	Skolkovo Institute of Science and Technology
Fedoseev, Aleksey	Skolkovo Institute of Science and Technology
Altamirano Cabrera, Miguel	Skolkovo Institute of Science and Technology (Skoltech), Moscow,
Peter Vimalathas, Robinroy	Intelligent Space Robotics Laboratory, Skolkovo Institute of Sci
Tsetserukou, Dzmitry	Skolkovo Institute of Science and Technology
Keywords: Cognitive Control Architectures, Multi-Modal Perception for HRI, Cooperating Robots Abstract: This paper introduces CognitiveOS, the first operating system designed for cognitive robots capable of functioning across diverse robotic platforms. CognitiveOS is structured as a multi-agent system comprising modules built upon a transformer architecture, facilitating communication through an internal monologue format. These modules collectively empower the robot to tackle intricate real-world tasks. The paper delineates the operational principles of the system along with descriptions of its nine distinct modules. The modular design endows the system with distinctive advantages over traditional end-to-end methodologies, notably in terms of adaptability and scalability. The system's modules are configurable, modifiable, or deactivatable depending on the task requirements, while new modules can be seamlessly integrated. This system serves as a foundational resource for researchers and developers in the cognitive robotics domain, alleviating the burden of constructing a cognitive robot system from scratch. Experimental findings demonstrate the system's advanced task comprehension and adaptability across varied tasks, robotic platforms, and module configurations, underscoring its potential for real-world applications. Moreover, in the category of Reasoning it outperformed CognitiveDog (by 15%) and RT2 (by 31%), achieving the highest to date rate of 77%. We provide a code repository and dataset for the replication of CognitiveOS: https://github.com/Arcwy0/cognitiveos

16:40-16:45, Paper ThET10.2	Add to My Program
CLSTR: Capability-Level System for Tracking Robots

Bejarano, Alexandra	Colorado School of Mines
Bonial, Claire	US Army Research Laboratory
Williams, Tom	Colorado School of Mines
Keywords: Multi-Robot Systems Abstract: For human operators to effectively task teams of robots, it is critical that they maintain situational awareness about the status of those robots. However, maintaining this situational awareness becomes particularly difficult when there are dynamic changes not only in the members of the robot team, but also in the capabilities of those robots. Prior work has shown that situational awareness can be supported through interfaces that effectively visualize task-relevant information. As such, in this work, we introduce a Capability-Level System for Tracking Robots (CLSTR), a new visualization for supporting operators to maintain an appropriate level of situational awareness over the capabilities of a dynamic robot team. In evaluating CLSTR through an online human-subject study (n=123), we found that a combination of different visual elements within an interface like the use of icons to summarize robot capabilities and animations to indicate team changes can help operators maintain awareness over robot teams.

16:45-16:50, Paper ThET10.3	Add to My Program
Mitigating Side Effects in Multi-Agent Systems Using Blame Assignment

Rustagi, Pulkit	Oregon State University
Saisubramanian, Sandhya	Oregon State University
Keywords: Multi-Robot Systems, Planning under Uncertainty, Path Planning for Multiple Mobile Robots or Agents Abstract: When independently trained or designed robots are deployed in a shared environment, their combined actions can lead to unintended negative side effects (NSEs). To ensure safe and efficient operation, robots must optimize task performance while minimizing the penalties associated with NSEs, balancing individual objectives with collective impact. We model the problem of mitigating NSEs in a cooperative multi-agent system as a bi-objective lexicographic decentralized Markov decision process. We assume independence of transitions and rewards with respect to the robots' tasks, but the joint NSE penalty creates a form of dependence in this setting. To improve scalability, the joint NSE penalty is decomposed into individual penalties for each robot using credit assignment, which facilitates decentralized policy computation. We empirically demonstrate, using mobile robots and in simulation, the effectiveness and scalability of our approach in mitigating NSEs. Code: https://tinyurl.com/RECON-NSE-Mitigation

16:50-16:55, Paper ThET10.4	Add to My Program
Decentralized Drone Swaps for Online Rebalancing of Drone Delivery Tasks

Vakil, Kamran	Boston University
Pierson, Alyssa	Boston University
Keywords: Multi-Robot Systems, Networked Robots, Sensor Networks Abstract: Recent research has seen the advancement of drone depot models as a promising way to allocate drones for large-scale task completion. Applications of these drone depot models include data collection, environmental monitoring, package delivery, and more. This paper focuses on sharing agents between static depots for task allocation based on expected demand. We model the problem as a Binary Nonlinear Program, then derive an iterative neighborhood search based on solving a series of Binary Linear Programs to drive towards the optimal configuration of agents for each depot. We show that our method is more tractable than a Branch and Bound approach for this model as problem complexity grows. We also show through simulations that with near optimal allocation between local depots, the overall system performance will outperform greedy and non-sharing approaches.

16:55-17:00, Paper ThET10.5	Add to My Program
A Fairness-Oriented Control Framework for Safety-Critical Multi-Robot Systems: Alternative Authority Control

Shi, Lei	Johns Hopkins University
Liu, Qichao	University of Wisconsin–Madison
Zhou, Cheng	Tencent
Li, Xiong	Tencent
Keywords: Multi-Robot Systems, Intelligent Transportation Systems, Collision Avoidance Abstract: This paper proposes a fair control framework for multi-robot systems, which integrates the newly introduced Alternative Authority Control (AAC) and Flexible Control Barrier Function (F-CBF). Control authority refers to a single robot which can plan its trajectory while considering others as moving obstacles, meaning the other robots do not have authority to plan their own paths. The AAC method dynamically distributes the control authority, enabling fair and coordinated movement across the system. This approach significantly improves computational efficiency, scalability, and robustness in complex environments. The proposed F-CBF extends traditional CBFs by incorporating obstacle shape, velocity, and orientation. F-CBF enhances safety by accurate dynamic obstacle avoidance. The framework is validated through simulations in multi-robot scenarios, demonstrating its safety, robustness and computational efficiency.

17:00-17:05, Paper ThET10.6	Add to My Program
FogROS2-PLR: Probabilistic Latency-Reliability for Cloud Robotics

Chen, Kaiyuan	University of California, Berkeley
Tian, Nan	University of California, Berkeley
Juette, Christian	Bosch Research
Qiu, Tianshuang	University of California, Berkeley
Ren, Liu	Robert Bosch North America Research Technology Center
Kubiatowicz, John	UC Berkeley
Goldberg, Ken	UC Berkeley
Keywords: Networked Robots, Cellular and Modular Robots, Engineering for Robotic Systems Abstract: Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance, cost, and ease of management. However, the network and cloud computing infrastructure are not designed for reliable timing guarantee, leading to fluctuating Quality-of-Service (QoS). In this work, we formulate an impossibility triangle of latency reliability, singleton deployment and commodity hardware. The theorem implicates that providing replicated resources with uncorrelated failures exponentially reduces the probability of missing a deadline. We present FogROS2-Probabilistic Latency Reliability (RLR) that uses multiple independent network interfaces to send requests to replicated cloud resources and uses the first response back. We design routing mechanisms to discover, connect, and route through non-default network interfaces on robots. FogROS2-PLR optimizes the selection of interfaces to servers by minimizing the probability of missing a deadline. We conduct a cloud-connected driving experiment with two 5G service providers, demonstrating FogROS2-PLR effectively provides smooth service quality even if one of the service providers experiences low coverage and base station handover. We use 99 Percentile (P99) latency to evaluate anomalous long-tail latency behavior. In the experiment, FogROS2-PLR improves P99 latency by up to 3.7x compared to using one service provider. We deploy FogROS2-PLR on a physical Stretch 3 robot with an indoor human-tracking task. Even in a fully covered Wi-Fi and 5G environment, FogROS2-PLR improves the responsiveness of the robot by reducing 36% of mean latency and 33% P99 latency. Code and supplementary can be found on website.

17:05-17:10, Paper ThET10.7	Add to My Program
Jointly Assigning Processes to Machines and Generating Plans for Autonomous Mobile Robots in a Smart Factory

Leet, Christopher	University of Southern California
Sciortino, Aidan	University of Rochester
Koenig, Sven	University of Southern California
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Industrial Robots Abstract: A modern smart factory runs a manufacturing procedure using a collection of programmable machines. Typically, materials are ferried between these machines using a team of mobile robots. To embed a manufacturing procedure in a smart factory, a factory operator must a) assign its processes to the smart factory's machines and b) determine how agents should carry materials between machines. A good embedding maximizes the smart factory's throughput; the rate at which it outputs products. Existing smart factory management systems solve the aforementioned problems sequentially, limiting the throughput that they can achieve. In this paper we introduce ACES, the Anytime Cyclic Embedding Solver, the first solver which jointly optimizes the assignment of processes to machines and the assignment of paths to agents. We evaluate ACES and show that it can scale to real industrial scenarios.


ThET11 Regular Session, 314	Add to My Program
Physical Human-Robot Interaction

Chair: Song, Kai-Tai	National Yang Ming Chiao Tung University
Co-Chair: Secchi, Cristian	Univ. of Modena & Reggio Emilia

16:35-16:40, Paper ThET11.1	Add to My Program
A Control Scheme for Collaborative Object Transportation between a Human and a Quadruped Robot Using the MIGHTY Suction Cup

Plotas, Konstantinos	Hellenic Mediterranean University
Papadakis, Emmanouil	Foundation for Research and Technology - Hellas
Drosakis, Drosakis	Foundation for Research and Technology–Hellas
Trahanias, Panos	Foundation for Research and Technology – Hellas (FORTH)
Papageorgiou, Dimitrios	Hellenic Mediterranean University
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration, Compliance and Impedance Control Abstract: In this work, a control scheme for human-robot collaborative object transportation is proposed, considering a quadruped robot equipped with the MIGHTY suction cup that serves both as a gripper for holding the object and a force/torque sensor. The proposed control scheme is based on the notion of admittance control, and incorporates a variable damping term aiming towards increasing the controllability of the human and, at the same time, decreasing her/his effort. Furthermore, to ensure that the object is not detached from the suction cup during the collaboration, an additional control signal is proposed, which is based on a barrier artificial potential. The proposed control scheme is proven to be passive and its performance is demonstrated through experimental evaluations conducted using the Unitree Go1 robot equipped with the MIGHTY suction cup.

16:40-16:45, Paper ThET11.2	Add to My Program
DTRT: Enhancing Human Intent Estimation and Role Allocation for Physical Human-Robot Collaboration

Liu, Haotian	Institute of Automation, Chinese Academy of Sciences
Tong, Yuchuang	The Institute of Automation of the Chinese Academy of Sciences
Zhang, Zhengtao	Institute of Automation, Chinese Academy of Sciences
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration, Intention Recognition Abstract: In physical Human-Robot Collaboration (pHRC), accurate human intent estimation and rational human-robot role allocation are crucial for safe and efficient assistance. Existing methods that rely on short-term motion data for intention estimation lack multi-step prediction capabilities, hindering their ability to sense intent changes and adjust human-robot assignments autonomously, resulting in potential discrepancies. To address these issues, we propose a Dual Transformer-based Robot Trajectron (DTRT) featuring a hierarchical architecture, which harnesses human-guided motion and force data to rapidly capture human intent changes, enabling accurate trajectory predictions and dynamic robot behavior adjustments for effective collaboration. Specifically, human intent estimation in DTRT uses two Transformer-based Conditional Variational Autoencoders (CVAEs), incorporating robot motion data in obstacle-free case with human-guided trajectory and force for obstacle avoidance. Additionally, Differential Cooperative Game Theory (DCGT) is employed to synthesize predictions based on human-applied forces, ensuring robot behavior align with human intention. Compared to state-of-the-art (SOTA) methods, DTRT incorporates human dynamics into long-term prediction, providing an accurate understanding of intention and enabling rational role allocation, achieving robot autonomy and maneuverability. Experiments demonstrate DTRT's accurate intent estimation and superior collaboration performance.

16:45-16:50, Paper ThET11.3	Add to My Program
Learning-Based Dynamic Robot-To-Human Handover

Kim, Hyeonseong	Korea University
Kim, Chanwoo	Korea University
Pan, Matthew	Queen's University
Lee, Kyungjae	Korea University
Choi, Sungjoon	Korea University
Keywords: Physical Human-Robot Interaction, Human-Aware Motion Planning, Learning from Experience Abstract: This paper presents a novel learning-based approach to dynamic robot-to-human handover, addressing the challenges of delivering objects to a moving receiver. We hypothesize that dynamic handover, where the robot adjusts to the receiver’s movements, results in more efficient and comfortable interaction compared to static handover, where the receiver is assumed to be stationary. To validate this, we developed a nonparametric method for generating continuous handover motion, conditioned on the receiver's movements, and trained the model using a dataset of 1,000 human-to-human handover demonstrations. We integrated preference learning for improved handover effectiveness and applied impedance control to ensure user safety and adaptiveness. The approach was evaluated in both simulation and real-world settings, with user studies demonstrating that dynamic handover significantly reduces handover time and improves user comfort compared to static methods. Videos and demonstrations of our approach are available at https://zerotohero7886.github.io/dyn-r2h-handover/.

16:50-16:55, Paper ThET11.4	Add to My Program
A Novel Dynamic Motion Primitives Framework for Safe Human-Robot Collaboration

Pupa, Andrea	University of Modena and Reggio Emilia
Di Vittorio, Filippo	University of Modena and Reggio Emilia
Secchi, Cristian	Univ. of Modena & Reggio Emilia
Keywords: Human-Robot Collaboration, Safety in HRI, Learning from Demonstration Abstract: Learning by demonstration techniques are gaining popularity within the human-robot collaboration (HRC) scenarios. This is because they allow to deeply exploit the versatility of collaborative robots. In this context, dynamic motion primitives (DMPs) have become a standard method for enabling human operators to easily teach tasks to robots. However, DMPs have two main limitations. First, they may encounter difficulties in generalizing some tasks, which can lead to non-intuitive behavior. Second, it is not guaranteed that the output of DMPs is compliant with ISO/TS 15066, which provides guidelines for assessing safety in collaborative scenarios. This work aims to address these two issues by introducing a novel control pipeline. This pipeline leverages a new variant of DMPs, called Swap DMPs (SDMPs), introduced in this work. The SDMPs enable a more intuitive behavior when the robot reproduces the learned task. Subsequently, SDMPs are encoded into a new optimization problem that ensures the robot complies with the Speed and Separation Monitoring (SSM) collaborative mode. The proposed approach has been experimentally validated and compared with traditional DMPs in both simulation and a real scenario, where a UR5e and a human operator collaborate on a polishing task.

16:55-17:00, Paper ThET11.5	Add to My Program
Depth Restoration of Hand-Held Transparent Objects for Human-To-Robot Handover

Yu, Ran	Tsinghua University
Yu, Haixin	Tsinghua Shenzhen International Graduate School
Li, Shoujie	Tsinghua Shenzhen International Graduate School
Huang, Yan	Tsinghua University
Song, Ziwu	Tsinghua University
Ding, Wenbo	Tsinghua University
Keywords: Multi-Modal Perception for HRI, Perception for Grasping and Manipulation, RGB-D Perception Abstract: Transparent objects are common in daily life, while their optical properties pose challenges for RGB-D cameras to capture accurate depth information. This issue is further amplified when these objects are hand-held, as hand occlusions further complicate depth estimation. For assistant robots, however, accurately perceiving hand-held transparent objects is critical to effective human-robot interaction. This paper presents a Hand-Aware Depth Restoration (HADR) method based on creating an implicit neural representation function from a single RGB-D image. The proposed method utilizes hand posture as an important guidance to leverage semantic and geometric information of hand-object interaction. To train and evaluate the proposed method, we create a high-fidelity synthetic dataset named TransHand-14K with a real-to-sim data generation scheme. Experiments show that our method has better performance and generalization ability compared with existing methods. We further develop a real-world human-to-robot handover system based on HADR, demonstrating its potential in human-robot interaction applications.

17:00-17:05, Paper ThET11.6	Add to My Program
Leveraging Semantic and Geometric Information for Zero-Shot Robot-To-Human Handover

Liu, Jiangshan	Southern University of Science and Technology
Dong, Wenlong	Southern University of Science and Technology
Wang, Jiankun	Southern University of Science and Technology
Meng, Max Q.-H.	The Chinese University of Hong Kong
Keywords: Physical Human-Robot Interaction, Robot Companions, Grasping Abstract: Human-robot interaction (HRI) encompasses a wide range of collaborative tasks, with handover being one of the most fundamental. As robots become more integrated into human environments, the potential for service robots to assist in handing objects to humans is increasingly promising. In robot-to-human (R2H) handover, selecting the optimal grasp is crucial for success, as it requires avoiding interference with the human's preferred grasp region and minimizing intrusion into their workspace. Existing methods either inadequately consider geometric information or rely on data-driven approaches, which often struggle to generalize across diverse objects. To address these limitations, we propose a novel zero-shot system that combines semantic and geometric information to generate optimal handover grasps. Our method first identifies grasp regions using semantic knowledge from vision-language models (VLMs) and, by incorporating customized visual prompts, achieves finer granularity in region grounding. A grasp is then selected based on grasp distance and approach angle to maximize human ease and avoid interference. We validate our approach through ablation studies and real-world comparison experiments. Results demonstrate that our system improves handover success rates and provides a more user-preferred interaction experience. Videos, appendixes and more are available at https://sites.google.com/view/vlm-handover/.

17:05-17:10, Paper ThET11.7	Add to My Program
Human-To-Robot Handover Control of an Autonomous Mobile Robot Based on Hand-Masked Object Pose Estimation

Song, Kai-Tai	National Yang Ming Chiao Tung University
Huang, Yu-Yun	National Yang Ming Chiao Tung University
Keywords: Human-Robot Collaboration, Grasping, Visual Servoing Abstract: This paper presents a human-to-robot handover design for an Autonomous Mobile Robot (AMR). The developed control system enables the AMR to navigate to a specific person and grasp the object that the person wants to handover. This paper proposes a motion planning algorithm for grasping an unseen object held in hand. Through hand detection and segmentation, the hand region is masked and removed from the acquired depth image, which is used to estimate the object pose for grasping. For grasp pose determination, we propose to add the Convolutional Block Attention Module (CBAM) to the Generative Grasping Convolutional Neural Network (GGCNN) model to enhance the recognition rate. For the object-grasp task, the AMR localizes the object in person’s hand, and uses the Model Predictive Control (MPC)-based controller to simultaneously control the mobile base and manipulator to grasp the object. A laboratory-developed mobile manipulator, equipped with a 6-DoF TM5M-900 is used for experimental verification. The experimental results show an average handover success rate of 81% for five different objects.


ThET12 Regular Session, 315	Add to My Program
Motion Control 2

Chair: Fan, Chuchu	Massachusetts Institute of Technology
Co-Chair: Oh, Sehoon	DGIST

16:35-16:40, Paper ThET12.1	Add to My Program
Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

Zhao, Xiyuan	Southeast University
Li, Huijun	Southeast University
Miao, Tianyuan	Southeast University
Zhu, Xianyi	Southeast University
Wei, Zhikai	Southeast University
Tan, Lifen	China Astronaut Research and Training Center
Song, Aiguo	Southeast University
Keywords: Multi-Modal Perception for HRI, Human Factors and Human-in-the-Loop Abstract: The rapid development of collaborative robotics has provided a new possibility of helping the elderly who has difficulties in daily life, allowing robots to operate according to specific intentions. However, efficient human-robot cooperation requires natural, accurate and reliable intention recognition in shared environments. The current paramount challenge for this is reducing the uncertainty of multimodal fused intention to be recognized and reasoning adaptively a more reliable result despite current interactive condition. In this work we propose a novel learning-based multimodal fusion framework Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP). Our approach combines Bayesian multimodal fusion method and batch confidence learning algorithm to improve accuracy, uncertainty reduction and success rate given the interactive condition. In particular, the generic and practical multimodal intention recognition framework can be easily extended further. Our desired assistive scenarios consider three modalities gestures, speech and gaze, all of which produce categorical distributions over all the finite intentions. The proposed method is validated with a six-DoF robot through extensive experiments and exhibits high performance compared to baselines.

16:40-16:45, Paper ThET12.2	Add to My Program
Optimize and Coordinate Multiple DMPs under Constraints to Achieve a Collaborative Manipulation Task

Kordia, Ali H.	Instituto Superior Técnico
Melo, Francisco S.	Instituto Superior Tecnico
Keywords: Motion Control, Planning, Scheduling and Coordination, Human-Robot Collaboration Abstract: This paper addresses a significant challenge in achieving collaborative tasks; how can a robot or multiple robots, endowed with a library of pre-learned primitive movements, generate multiple simultaneous coordinated robotic movements, adapting and optimizing those in the library, to complete one collaborative task? This work can thus be seen as a follow-up to the work with a motion presented as dynamic movement primitive (DMP) that now considers collaborative tasks and the existence of multiple robots/manipulators. Specifically, we start with a simple task using one DMP and extend it to accommodate the coordinated execution of multiple DMPs in robots with multiple manipulators or---alternatively---multiple robots with a single manipulator. We investigate mechanisms to jointly optimize multiple DMPs to perform one task in a coordinated fashion. The joint trajectory is built from initial DMPs learned for a single manipulator, and its optimization must comply with task-specific constraints. We illustrate the application of our approach both in a simulated environment and in a simulated and real Baxter robot.

16:45-16:50, Paper ThET12.3	Add to My Program
A Modified Resistance Model for Magnetic Honeycomb Robots to Navigate in Low Reynolds Number Fluids

Zou, Leyao	Fudan University
Ma, Shihao	Fudan University
Liu, Yi	Fudan University
Dong, Xinyang	Fudan University
Zhou, Ziqing	Fudan University
Ouyang, Chun	Fudan University
Gan, Zhongxue	Fudan University
Keywords: Motion Control, Micro/Nano Robots, Motion and Path Planning Abstract: In recent years, magnetically controlled microrobots have garnered significant attention. This paper presents the H-robot, a self-designed microrobot featuring an innovative structure. The H-robot features a honeycomb porous spherical design specifically engineered to enhance cargo capacity. A new dynamic model for this structure has been developed for low Reynolds number fluid environments, along with a robust backstepping sliding mode control (RBSMC) strategy. Experiments were conducted in a calibrated magnetic field generated by a magnetic field generator to achieve precise motion control. The results demonstrate that the H-robot accurately tracks standard trajectories, with root mean square errors (RMSE) of 9.09×10−4 m for the Number-8 path and 8.29×10−4 m for the S-shaped path. Additionally, the proposed resistance model enhances tracking accuracy by 73.61% compared to traditional models, effectively adjusting the dynamic behavior of the H-robot in low Reynolds number fluids and significantly improving its motion performance. Finally, path planning experiments in a maze demonstrate the H-robot’s ability to navigate and avoid obstacles.

16:50-16:55, Paper ThET12.4	Add to My Program
Manual, Semi or Fully Autonomous Flipper Control? a Framework for Fair Comparison

Číhala, Valentýn	Ceske Vysoke Uceni Technicke V Praze, FEL
Pecka, Martin	Ceske Vysoke Uceni Technicke V Praze, FEL
Svoboda, Tomas	Ceske Vysoke Uceni Technicke V Praze, FEL
Zimmermann, Karel	Ceske Vysoke Uceni Technicke V Praze, FEL
Keywords: Motion Control, Software Tools for Benchmarking and Reproducibility, Imitation Learning Abstract: We investigated the performance of existing semi- and fully autonomous methods for controlling flipper-based skid-steer robots. Our study involves reimplementation of these methods for fair comparison and it introduces a novel semi-autonomous control policy that provides a compelling trade-off among current state-of-the-art approaches. We also propose new metrics for assessing cognitive load and traversal quality and offer a benchmarking interface for generating Quality-Load graphs from recorded data. Our results, presented in a 2D Quality-Load space, demonstrate that the new control policy effectively bridges the gap between autonomous and manual control methods. Additionally, we reveal a surprising fact that fully manual, continuous control of all six degrees of freedom remains highly effective when performed by an experienced operator on a well-designed analog controller from third person view.

16:55-17:00, Paper ThET12.5	Add to My Program
Safety-Critical Locomotion of Biped Robots in Infeasible Paths: Overcoming Obstacles During Navigation Toward Destination

Lee, Jaemin	North Carolina State University
Dai, Min	California Institute of Technology
Kim, Jeeseop	Caltech
Ames, Aaron	California Institute of Technology
Keywords: Motion Control, Robot Safety, Humanoid and Bipedal Locomotion Abstract: This paper proposes a safety-critical locomotion control framework employed for legged robots exploring through infeasible path in obstacle-rich environments. Our research focus is on achieving safe and robust locomotion where robots confront unavoidable obstacles en route to their designated destination. Through the utilization of outcomes from physical interactions with unknown objects, we establish a hierarchy among the safety-critical conditions avoiding the obstacles. This hierarchy enables the generation of a safe reference trajectory that adeptly mitigates conflicts among safety conditions and reduce the risk while controlling the robot toward its destination without additional motion planning methods. In addition, robust bipedal locomotion is achieved by utilizing the Hybrid Linear Inverted Pendulum model, coupled with a disturbance observer addressing a disturbance from the physical interaction.

17:00-17:05, Paper ThET12.6	Add to My Program
Optimal Framework for Constrained Admittance Path-Following Control

Besi, Giulio	University of Modena and Reggio Emilia
Pupa, Andrea	University of Modena and Reggio Emilia
Secchi, Cristian	Univ. of Modena & Reggio Emilia
Ferraguti, Federica	Università Degli Studi Di Modena E Reggio Emilia
Keywords: Motion and Path Planning, Physical Human-Robot Interaction, Compliance and Impedance Control Abstract: In this article, an optimal controller for achieving constrained admittance control is proposed. This controller strictly adheres to the constraint boundaries while ensuring minimal variations in kinematic energy. The proposed method integrates admittance control for human-robot interaction with the Udwadia-Kalaba equations for constrained motion into a unified framework. The proposed architecture has been tested and validated both with simulations and real tests on a 6-DoF UR5e robot. The results demonstrate that the proposed architecture outperforms virtual fixtures, one of the most commonly used techniques to implement effective path-following control.

17:05-17:10, Paper ThET12.7	Add to My Program
Robust Orientation Control of Robot Manipulator Using Orientation Disturbance Observer

Choi, Kiyoung	Deagu Gyeongbuk Institute of Science and Technology
Song, JunHo	Daegu Gyeongbuk Institute of Science and Technology
Yun, WonBum	Korea Institute of Robotics and Technology Convergence
Oh, Sehoon	DGIST
Keywords: Motion Control, Dynamics Abstract: This paper presents a robust control algorithm for precise orientation control of robot manipulators using a Disturbance Observer (DOB) specifically designed for orientation dynamics. Our approach addresses the challenges of 3D orientation control by incorporating various orientation representations, such as Euler angles, quaternions, and exponential coordinates, and analyzing their impact on DOB performance. Through theoretical analysis and experimental validation, we demonstrate the effectiveness of our method in achieving high-precision orientation control under uncertainties and disturbances. This work offers a comprehensive framework for robust orientation control, advancing the application of DOB in complex robotic tasks.

17:10-17:15, Paper ThET12.8	Add to My Program
Predictive Kinematic Coordinate Control for Aerial Manipulators Based on Modified Kinematics Learning

Li, Zhengzhen	Westlake University
Shen, Jiahao	Westlake University
Ji, Mengyu	Westlake University
Cao, Huazi	Beihang University
Zhao, Shiyu	Westlake University
Keywords: Motion Control, Aerial Systems: Mechanics and Control, Kinematics Abstract: High-precision manipulation has always been a developmental goal for aerial manipulators. This paper investigates the kinematic coordinate control issue in aerial manipulators. We propose a predictive kinematic coordinate control method based on model learning, which includes a learning-based modified kinematic model and a model predictive control (MPC) scheme based on weight allocation. Compared to existing methods, our proposed approach offers several attractive features. First, the kinematic model incorporates closed-loop dynamics characteristics and online residual learning. Compared to methods that do not consider closed-loop dynamics and residuals, our proposed method has improved accuracy by 59.6%. Second, a MPC method that considers weight allocation has been proposed, which can coordinate the motion strategies of quadcopters and manipulators. Compared to methods that do not consider weight allocation, the proposed method can meet the requirements of more tasks. The proposed approach is verified through complex trajectory tracking and moving target tracking experiments. The results validate the effectiveness of the proposed method.


ThET13 Regular Session, 316	Add to My Program
Resiliency and Security 2

Chair: Ueda, Jun	Georgia Institute of Technology
Co-Chair: Chou, Glen	Georgia Institute of Technology

16:35-16:40, Paper ThET13.1	Add to My Program
Affine Transformation-Based Perfectly Undetectable False Data Injection Attacks on Remote Manipulator Kinematic Control with Attack Detector

Ueda, Jun	Georgia Institute of Technology
Blevins, Jacob	Georgia Institute of Technology
Keywords: Networked Robots, Failure Detection and Recovery, Motion Control Abstract: This paper demonstrates the viability of perfectly undetectable affine transformation attacks against robotic manipulators where intelligent attackers can inject multiplicative and additive false data while remaining completely hidden from system users. The attacker can implement these communication line attacks by satisfying three Conditions presented in this work. These claims are experimentally validated on a FANUC 6 degree of freedom manipulator by comparing a nominal (non-attacked) trial and a detectable attack case against three perfectly undetectable trajectory attack Scenarios: scaling, reflection, and shearing. The results show similar observed end effector error for the attack Scenarios and the nominal case, indicating that the perfectly undetectable affine transformation attack method keeps the attacker perfectly hidden while enabling them to attack manipulator trajectories.

16:40-16:45, Paper ThET13.2	Add to My Program
CDA: Covert Deception Attacks in Multi-Agent Resource Scheduling

Hao, Wei	Nanjing University
Liu, Jia	Nanjing University
Li, Wenjie	Nanjng University
Chen, Lijun	Nanjing University
Keywords: Robot Safety, Swarm Robotics, Deep Learning Methods Abstract: In this letter, we address the critical security concerns in multi-agent systems, where illegal infiltration is commonly used to convert agents into malicious entities. Existing research predominantly focuses on explicit malicious attack patterns. Our work introduces a covert deception attack framework in the context of multi-agent resource scheduling scenarios. We first highlight vulnerabilities in scheduling strategies based on time and path costs. Exploiting these weaknesses, an infiltrated agent clandestinely gathers motion characteristics of other agents while posing as a teammate. Using these motion characteristics, the infiltrated agent employs an LSTM architecture to learn and predict congestion areas, thereby designing attack paths with greater time efficiency. This approach allows the infiltrated agent to secure additional resources and evade capture more effectively. Validation through simulation and real-world experiments demonstrates the feasibility and effectiveness of our approach, underscoring the importance of evaluating covert attacks in risk assessments within multi-agent systems.

16:45-16:50, Paper ThET13.3	Add to My Program
Early Model-Based Safety Analysis for Collaborative Robotic Systems (I)

Manjunath, Meenakshi	Technical University of Applied Sciences Würzburg-Schweinfurt
Jesus Raja, Jeshwitha	Technical University of Applied Sciences Würzburg-Schweinfurt
Daun, Marian	Technical University of Applied Sciences Würzburg-Schweinfurt
Keywords: Safety in HRI, Intelligent and Flexible Manufacturing, Modeling, Control, and Learning for Soft Robots Abstract: The current era is marked by an accelerated digitization of manufacturing processes, with robotic systems increasingly integrated into various workflows. Yet, despite significant advancements, it is impractical to fully automate certain tasks due to prohibitive costs and technical constraints. As a result, there’s a growing emphasis on human-robot collaboration (HRC) for intricate operations. In HRC scenarios, humans and robots co-inhabit the same work environment, operating side by side. More than just mere coexistence in the same space, they actively collaborate on shared tasks, thus raising the stakes in terms of safety. The dynamic behavior of robots must be synchronized with the anticipated and unexpected human actions, adding another layer of complexity to the safety considerations. It is essential to conduct comprehensive safety analyses that identify potential risks that pose harm to the human operator. As a proactive measure to foster early-stage safety and risk analysis, we propose the use of goal models. The approach enables the specification of safety threats within the HRC context, thereby facilitating the development of safety tasks and supportive monitoring mechanisms. This approach helps in the refinement and implementation of safety measures, ensuring a secure and productive environment for human-robot collaboration.

16:50-16:55, Paper ThET13.4	Add to My Program
Investigating Security Threats in Multi-Tenant ROS 2 Systems

Xia, Lichen	University of Delaware
Gao, Xing	University of Delaware
Shi, Weisong	University of Delaware
Keywords: Multi-Robot Systems, Software Architecture for Robotic and Automation Abstract: Robot Operating System (ROS) has been widely used to develop robotic applications. The first generation of ROS generally lacks security features, and ROS 2 is introduced with security support. However, security concerns still exist for running ROS in practical multi-tenant environments. In this paper, we conduct an in-depth investigation into the security of ROS 2. We focus on vulnerabilities in ROS nodes and topics and intend to explore methods to break the isolation and security mechanisms systematically. We devise a set of strategies that can be exploited by attackers to escalate privilege or cause information leakage in a multi-tenant environment. These attacks can bypass existing isolation and security mechanisms, including ROS 2’s native security module. To validate our findings, we employ simulations across various real-world scenarios to demonstrate how attackers could exploit these vulnerabilities to bypass existing security mechanisms. Finally, we present several defense practices to mitigate these identified threats.

16:55-17:00, Paper ThET13.5	Add to My Program
Multi-Task Robustness Enhancement Framework against Various Adversarial Patches

Jing, Lihua	Chinese Academy of Sciences
Wang, Rui	Chinese Academy of Sciences
Li, Runbo	Chinese Academy of Sciences
Zhu, Zixuan	Chinese Academy of Sciences
Wei, Xingxing	Beihang University
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Visual Learning Abstract: Autonomous systems leveraging visual perception face a rising threat from adversarial patches, jeopardizing their robustness. Existing defense methods adaptable to various pre-trained models typically rely on observed patch characteristics or prior attack data, having difficulty adapting to new threats. This study innovatively focuses on modeling patch attack behavior instead of existing patches, proposing a unified robustness enhancement framework against various adversarial patches. Through self-supervised learning, we accurately locate diverse adversarial patches without prior attack knowledge. Furthermore, we introduce an efficient adaptive patch inpainting method to mitigate patch impact while maintaining visual coherence. Experiments show that our methods effectively boost the robustness of visual perception models against various adversarial patches across different tasks.

17:00-17:05, Paper ThET13.6	Add to My Program
Perfectly Undetectable False Data Injection Attacks on Encrypted Bilateral Teleoperation System Based on Dynamic Symmetry and Malleability

Kwon, Hyukbin	Georgia Institute of Technology
Kawase, Hiroaki	The University of Electro-Communications
Nieves-Vazquez, Heriberto Andres	Georgia Institute of Technology
Kogiso, Kiminao	The University of Electro-Communications
Ueda, Jun	Georgia Institute of Technology
Keywords: Telerobotics and Teleoperation, Networked Robots, Dynamics Abstract: This paper investigates the vulnerability of bilateral teleoperation systems to perfectly undetectable False Data Injection Attacks (FDIAs). Teleoperation, one of major applications in robotics, involves a leader manipulator operated by a human and a follower manipulator at a remote site, connected via a communication channel. While this setup enables operation in challenging environments, it also introduces cybersecurity risks, particularly in the communication link. The paper focuses on a specific class of cyberattacks: perfectly undetectable FDIAs, where attackers alter signals without leaving detectable traces at all. Compared to previous research on linear and first-order nonlinear systems, this paper examines bilateral teleoperation systems with second-order nonlinear manipulator dynamics. The paper derives mathematical conditions based on Lie Group theory that enable such attacks, demonstrating how an attacker can modify the follower manipulator's motion while the operator perceives normal operation through the leader device. This vulnerability challenges conventional detection methods based on observable changes and highlights the need for advanced security measures in teleoperation systems. To validate the theoretical results, the paper presents experimental demonstrations using a teleoperation system connecting robots in the US and Japan.


ThET14 Regular Session, 402	Add to My Program
Hand and Gripper Design

Chair: Bekiroglu, Yasemin	Chalmers University of Technology, University College London
Co-Chair: Plecnik, Mark	University of Notre Dame

16:35-16:40, Paper ThET14.1	Add to My Program
A Novel Under-Actuated Gripper Based on Passive-Locking Mechanism for Stable Gripping under Environmental Constraints

Yang, Seokjun	Kwangwoon University
Lee, Sungon	Hanyang University
Yang, Woosung	Kwangwoon University
Keywords: Grippers and Other End-Effectors, Mechanism Design, Grasping Abstract: This paper presents a novel under-actuated two finger gripper that passively adapts to various environments and maintains its grip posture using a passive-locking mechanism. The proposed mechanism features fingers with three phalanges, each incorporating four-bar and eight-bar linkages arranged in parallel. These linkages perform crucial functions, including maintaining the grip angle and ensuring passive characteristics during pinch grips. Previous grippers with passive mechanisms and three-phalanx fingers faced issues with gripping instability, particularly when changes in the passive joint angle were caused by object inertia or external lateral forces. To address this problem, we propose a new passive-locking mechanism utilizing an eight-bar linkage. This innovative design is engineered to adapt to environmental conditions, establish a secure grip, and maintain the grip angle of the passive joint after the grip is achieved. To demonstrate the advantages of the proposed mechanism, this paper conducts a fingertip force vector analysis and a mobility analysis according to the pinch sequence. It also details the derivation process and principles of the mechanism. The gripper’s operational range and gripping force are examined through kinematic analysis and verified by simulation. Furthermore, the study shows that the proposed mechanism effectively responds to environmental constraints, even in environments with obstacles surrounding the object. Comparative experiments with and without a contact bar indicate that the proposed gripper can stably secure an object in scenarios involving swing motions and external forces of approximately 5N.

16:40-16:45, Paper ThET14.2	Add to My Program
Juzu Type Gripper That Can Change Both Shape and Firmness

Hara, Shunya	Osaka University
Fukuda, Osamu	Saga University
Higashimori, Mitsuru	Osaka University
Keywords: Grippers and Other End-Effectors, Mechanism Design Abstract: This paper presents a novel gripper capable of actively changing both shape and firmness. The gripper increases its grasp ability by changing its finger posture and firmness suitable for given target objects. In the proposed gripper, each finger is constructed by serially connecting multiple Juzu units. By controlling the angles between neighboring Juzu units individually using two actuators used for sending and bending, arbitrary finger shapes can be generated. In addition, by controlling the tension of the wire that penetrated all Juzu units in each finger, the friction between Juzu units is adjusted and the firmness of finger can be varied. A prototype gripper was designed and developed, and experiments to evaluate the capabilities of changing shape and firmness were conducted. Furthermore, through experiments of preshaping and grasping for various objects with different shape and size, the validity of the proposed method was demonstrated.

16:45-16:50, Paper ThET14.3	Add to My Program
A Direct-Drive Gripper Designed by Ellipse Synthesis across Two Output Modes

Ramesh, Shashank	University of Notre Dame
Plecnik, Mark	University of Notre Dame
Keywords: Mechanism Design, Grippers and Other End-Effectors, Kinematics Abstract: There are many ways for a gripper to estimate the forces between its fingers. If powered by direct-drive brushless motors, then one technique is to measure their current. This is not the most accurate technique, but it is simple, keeps the sensor remote, and requires no new components. The estimation involves multiplying current signals through by the torque constant and the inverse transpose of the Jacobian. The Jacobian either amplifies the signal from fingertip force to motor current (at the cost of tip force production), or diminishes it (with the gain of tip force production), indicating an inherent trade-off. However, the Jacobian is a function of configuration, and for any workspace point there are multiple configurations (multiple inverse kinematics solutions), therefore a selection of Jacobian exists. For a given workspace point, the number of Jacobian choices is just a few, but these choices can be designed (through dimensional synthesis) to overcome the trade-off. The problem can be framed as velocity ellipse synthesis over multiple output modes. In this work, we conduct optimal synthesis to compute a new gripper design. The gripper was built and tested. It transitions between two different modes: sense mode and grip mode. Sense mode can sense forces 3 times smaller than grip mode. Grip mode can exert forces 4 times greater than sense mode.

16:50-16:55, Paper ThET14.4	Add to My Program
Mechanisms and Computational Design of Multi-Modal End-Effector with Force Sensing Using Gated Networks

Tanaka, Yusuke	University of California, Los Angeles
Zhu, Alvin	University of California Los Angeles
Lin, Richard	UC Los Angeles
Mehta, Ankur	UCLA
Hong, Dennis	UCLA
Keywords: Grippers and Other End-Effectors, Legged Robots, Climbing Robots Abstract: In limbed robotics, end-effectors must serve dual functions, such as both feet for locomotion and grippers for grasping, which presents design challenges. This paper introduces a multi-modal end-effector capable of transitioning between flat and line foot configurations while providing grasping capabilities. MAGPIE integrates 8-axis force sensing using proposed mechanisms with hall effect sensors, enabling both contact and tactile force measurements. We present a computational design framework for our sensing mechanism that accounts for noise and interference, allowing for desired sensitivity and force ranges and generating ideal inverse models. The hardware implementation of MAGPIE is validated through experiments, demonstrating its capability as a foot and verifying the performance of the sensing mechanisms, ideal models, and gated network-based models.

16:55-17:00, Paper ThET14.5	Add to My Program
Single-Motor-Driven (4 + 2)-Fingered Robotic Gripper Capable of Expanding the Workable Space in the Extremely Confined Environment

Nishimura, Toshihiro	Kanazawa University
Akasaka, Keisuke	Kanazawa University
Ishikawa, Subaru	Kanazawa University
Watanabe, Tetsuyou	Kanazawa University
Keywords: Grippers and Other End-Effectors, Mechanism Design, Grasping Abstract: This study proposes a novel robotic gripper that can expand workable spaces in a target environment to pick up objects from confined spaces. The proposed gripper is most effective for retrieving objects from deformable environments, such as taking an object out of a drawstring bag, or for extracting target objects located behind surrounding objects. The proposed gripper achieves both work-space expansion and grasping motion by using only a single motor. The gripper is equipped with four outer fingers for expanding the environment and two inner fingers for grasping an object. The inner and outer fingers move in different directions for their respective functions of grasping and spatial expansion. To realize two different movements of the fingers, a novel self-motion switching mechanism that switches between the functions as feed-screw and rack-and-pinion mechanisms is developed. The mechanism switches the motions according to the magnitude of the force applied to the inner fingers. This paper presents the mechanism design of the developed gripper, including the self-motion switching mechanism and the actuation strategy for expanding the workable space. The mechanical analysis is also presented, and the analysis result is validated experimentally. Moreover, an automatic object-picking system using the developed gripper is constructed to evaluate the gripper.

17:00-17:05, Paper ThET14.6	Add to My Program
A Three-Finger Adaptive Gripper with Finger-Embedded Suction Cups for Enhanced Object Grasping Mechanism

Yoon, Jimin	Sungkyunkwan University
Jeong, Heeyeon	Sungkyunkwan University
Park, Jae Hyeong	Sungkwunkwan University
Gong, Young Jin	SungKyunKwan University(SKKU)
Shin, Dongsu	Sungkyunkwan University
Seo, Hyeon-Woong	Sungkyunkwan University
Moon, Seung Jae	Sungkyunkwan, Mechanical Engineering, Robottory
Choi, Hyouk Ryeol	Sungkyunkwan University
Keywords: Grippers and Other End-Effectors, Grasping, Mechanism Design Abstract: With the growth of logistics automation, there is an increasing demand for advanced grippers. This study presents a gripper that integrates suction cups into the fingertips to overcome the limitations of traditional robotic gripping methods. Designed with a 5-degree-of-freedom (DOF) structure, the gripper allows for angle adjustment of the suction cups, facilitating effective grasping in various environments. Its adaptive grasping mechanism simplifies control by using the fingertips and distal phalanxes to cage objects without manually controlling them. The versatility of the gripper was tested by performing hybrid finger-suction gripping, as well as conventional finger and suction gripping. These advanced gripping strategies are designed to enhance flexibility and efficiency in logistics automation when handling a diverse range of objects.


ThET15 Regular Session, 403	Add to My Program
Datasets and Benchmarking

Chair: Xiao, Ted	Google DeepMind
Co-Chair: Sintov, Avishai	Tel-Aviv University

16:35-16:40, Paper ThET15.1	Add to My Program
Syn-Mediverse: A Multimodal Synthetic Dataset for Intelligent Scene Understanding of Healthcare Facilities

Mohan, Rohit	University of Freiburg
Arce y de la Borbolla, José	University of Freiburg
Mokhtar, Sassan	University of Freiburg
Cattaneo, Daniele	University of Freiburg
Valada, Abhinav	University of Freiburg
Keywords: Data Sets for Robotic Vision, Computer Vision for Medical Robotics, Medical Robots and Systems Abstract: Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over 48,000 images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state- of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at http://syn-mediverse.cs.uni-freiburg.de.

16:40-16:45, Paper ThET15.2	Add to My Program
STEER: Flexible Robotic Manipulation Via Dense Language Grounding

Smith, Laura	UC Berkeley
Irpan, Alexander	Google
Gonzalez Arenas, Montserrat	Google
Kirmani, Sean	Google DeepMind
Kalashnikov, Dmitry	Google Brain
Shah, Dhruv	Google DeepMind
Xiao, Ted	Google DeepMind
Keywords: Learning from Demonstration, Data Sets for Robot Learning, Big Data in Robotics and Automation Abstract: The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges high-level, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training language-grounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot's behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training.

16:45-16:50, Paper ThET15.3	Add to My Program
MBE-ARI: A Multimodal Dataset Mapping Bi-Directional Engagement in Animal-Robot Interaction

Noronha, Ian	Purdue University
Jawaji, Advait Prasad	Purdue University
Soto, Juan	Purdue University
An, Jiajun	The Chinese University of Hong Kong
Gu, Yan	Purdue University
Kaur, Upinder	Purdue University
Keywords: Gesture, Posture and Facial Expressions, Data Sets for Robot Learning, Multi-Modal Perception for HRI Abstract: Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.

16:50-16:55, Paper ThET15.4	Add to My Program
A Diffusion-Based Data Generator for Training Object Recognition Models in Ultra-Range Distance

Bamani Beeri, Eran	Tel Aviv University
Nissinman, Eden	Tel-Aviv University
Koenigsberg, Lisa	Tel-Aviv University
Meir, Inbar	Tel Aviv University
Sintov, Avishai	Tel-Aviv University
Keywords: Data Sets for Robotic Vision, Gesture, Posture and Facial Expressions, Recognition Abstract: Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.

16:55-17:00, Paper ThET15.5	Add to My Program
MovingCables: Moving Cable Segmentation Method and Dataset

Holesovsky, Ondrej	Czech Technical University in Prague
Skoviera, Radoslav	Czech Institute of Informatics, Robotics, and Cybernetics; Czech
Hlavac, Vaclav	Czech Technical University in Prague
Keywords: Data Sets for Robotic Vision, Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization Abstract: Manipulating cluttered cables, hoses or ropes is challenging for both robots and humans. Humans often simplify these perceptually challenging tasks by pulling or pushing tangled cables and observing the resulting motions. We propose to use a similar interactive perception principle to aid robotic cable manipulation. A fundamental building block of such an endeavor is a cable motion segmentation method that densely labels moving cable image pixels. This letter presents MovingCables, a moving cable dataset, which we hope will motivate the development and evaluation of cable motion segmentation algorithms. The dataset consists of real-world image sequences automatically annotated with ground truth segmentation masks and optical flow. In addition, we propose a cable motion segmentation method and evaluate its performance on the new dataset.


ThET16 Regular Session, 404	Add to My Program
Soft Sensors

Chair: Stuart, Hannah	UC Berkeley
Co-Chair: Monje, Concepción A.	University Carlos III of Madrid

16:35-16:40, Paper ThET16.1	Add to My Program
Dynamic Contact Force Estimation Via Integration of Soft Sensor Based on Fiber Bragg Grating and Series Elastic Actuator

Na, Hyunbin	DGIST
Lee, Hyunwook	Gyeongsang National University
Park, Chang Hyun	Pusan National University
Kim, Gyeong Hun	Pusan National University
Kim, Chang-Seok	Pusan National University
Oh, Sehoon	DGIST
Keywords: Force and Tactile Sensing, Compliant Joints and Mechanisms, Flexible Robotics Abstract: Research on interactive force measurement in robotics follows two trends: distributed force sensing using soft tactile sensors and centered force sensing using rigid sensors. This study proposes a novel force sensing mechanism and algorithm to integrate the two approaches taking advantage of a soft tactile sensor and rigid actuator based on spring. Soft tactile sensors allow for gentle contact with humans, but have limited recovery and measurable force range. The rigidity of a spring-based actuator is utilized to address their force estimation issues. This allows for estimating a wider range of forces while maintaining the softness. The paper presents a novel approach for integrating two sensors using sophisticated algorithms. Specifically, a deep neural network is developed to estimate the contact location through the tactile sensor. Subsequently, a state-space observer is proposed based on the dynamic characteristics of the robot link, which integrates the network output and the torque measurements obtained from a spring-based actuator. This algorithm provides accurate force estimation during dynamic behavior and enables a wide measurable force range across the entire area of the robot link. The efficacy of the proposed mechanism and algorithm is validated through rigorous experimentation, demonstrating the fast recovery characteristics and accuracy.

16:40-16:45, Paper ThET16.2	Add to My Program
A Piezoresistive Printable Strain Sensor for Monitoring and Control of Soft Robotic Links

Sánchez, Claudia	University Carlos III of Madrid
Rodriguez, Daniel	AIMPLAS
Otero, Susana	AIMPLAS
Monje, Concepción A.	University Carlos III of Madrid
Keywords: Soft Sensors and Actuators, Additive Manufacturing, Flexible Robotics Abstract: Integrating sensors into soft links with complex geometries without compromising their flexibility, precision, or structural integrity remains one of the main challenges in soft robotics. This article presents the design, fabrication, and electromechanical evaluation of a 3D-printed flexible strain sensor tailored for monitoring and controlling these links. By combining Fused Filament Fabrication (FFF) and Direct Ink Writing (DIW) technologies, we manufactured a sensor composed of a thermoplastic polyurethane (TPU) substrate and a pattern of silver (Ag) nanoparticles ink, ensuring high flexibility and conductivity. We performed electromechanical tests to assess the sensor's performance, including three-point bending tests, cyclic loading to evaluate its durability, and angular deflection measurements to confirm its precision in detecting bending angles. The sensor demonstrated efficient piezoresistive behavior within a defined working range between 3% and 8% of flexure strain with a Gauge Factor (GF) of 0.24 and stable repeatability. We also tested its integration into a soft link, showing that the sensor maintains flexibility and accuracy during deformation.

16:45-16:50, Paper ThET16.3	Add to My Program
AnySkin: Plug-And-Play Skin Sensing for Robotic Touch

Bhirangi, Raunaq Mahesh	New York University
Pattabiraman, Venkatesh	New York University
Erciyes, Mehmet Enes	New York University
Cao, Yifeng	Columbia University
Hellebrekers, Tess	Meta AI Research
Pinto, Lerrel	New York University
Keywords: Soft Sensors and Actuators, Sensorimotor Learning, Transfer Learning Abstract: While tactile sensing is widely accepted as an important and useful sensing modality, its use pales in comparison to other sensory modalities like vision and proprioception. AnySkin addresses the critical challenges that impede the use of tactile sensing -- versatility, replaceability, and data reusability. Building on the simplistic design of ReSkin, and decoupling the sensing electronics from the sensing interface, AnySkin simplifies integration making it as straightforward as putting on a phone case and connecting a charger. Furthermore, AnySkin is the first uncalibrated tactile-sensor with cross-instance generalizability of learned manipulation policies. To summarize, this work makes three key contributions: first, we introduce a streamlined fabrication process and a design tool for creating an adhesive-free, durable and easily replaceable magnetic tactile sensor; second, we characterize slip detection and policy learning with the AnySkin sensor; and third, we demonstrate zero-shot generalization of models trained on one instance of AnySkin to new instances, and compare it with popular existing tactile solutions like DIGIT and ReSkin. Code, design files, and videos of policy experiments can be found on https://any-skin.github.io

16:50-16:55, Paper ThET16.4	Add to My Program
Proximity and Visuotactile Point Cloud Fusion for Contact Patches in Extreme Deformation

Yin, Jessica	University of Pennsylvania
Shah, Paarth	University of Oxford
Kuppuswamy, Naveen	Toyota Research Institute
Beaulieu, Andrew	Toyota Research Institute
Uttamchandani, Avinash	Toyota Research Institute
Castro, Alejandro	Toyota Research Institute
Pikul, James	University of Pennsylvania
Tedrake, Russ	Massachusetts Institute of Technology
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Soft Robot Materials and Design Abstract: Visuotactile sensors are a popular tactile sensing strategy due to high-fidelity estimates of local object geometry. However, existing algorithms for processing raw sensor inputs to useful intermediate signals such as contact patches struggle in high-deformation regimes. This is due to physical constraints imposed by sensor hardware and small-deformation assumptions used by mechanics-based models. In this work, we propose a fusion algorithm for proximity and visuotactile point clouds for contact patch segmentation, entirely independent from membrane mechanics. This algorithm exploits the synchronous, high spatial resolution proximity and visuotactile modalities enabled by an extremely deformable, selectively transmissive soft membrane, which uses visible light for visuotactile sensing and infrared light for proximity depth. We evaluate our contact patch algorithm in low (10%), medium (60%), and high (100%+) strain states. We compare our method against three baselines: proximity-only, tactile-only, and a first principles mechanics model. Our approach outperforms all baselines with an average RMSE under 2.8 mm of the contact patch geometry across all strain ranges. We demonstrate our contact patch algorithm in four applications: varied stiffness membranes, torque and shear-induced wrinkling, closed loop control, and pose estimation.

16:55-17:00, Paper ThET16.5	Add to My Program
Spatial Sensitivity Equalization of ERT-Based Robotic Skin through Gauge Factor Distribution Optimization

Cho, Junhwi	KAIST
Chung, Hyunjo	Korea Advanced Institute of Science and Technology (KAIST)
Park, Kyungseo	Daegu Gyeongbuk Institute of Science and Technology (DGIST)
Kim, Jung	KAIST
Keywords: Force and Tactile Sensing, Touch in HRI, Soft Sensors and Actuators Abstract: Electrical Resistance Tomography (ERT) has emerged as a promising technology for large-area robotic skin due to its ability to reconstruct pressure distribution over extensive regions using a few sparsely distributed electrodes. Despite ERT’s potential to reconstruct the external forces applied on 3D surfaces, the uneven distribution of spatial sensitivity leads to significant errors in identifying the physical quantities of contacts, inhibiting this technique from being an effective tactile sensor. To address this issue, this paper proposes a method to equalize the spatial sensitivity by modulating the conductivity of ERT sensors through topology optimization. In a simulation environment, the sensor's conductive domain was converted into a binary image and optimized to equalize spatial sensitivity and reduce disparities between low and high-sensitivity areas. Additionally, we present a sensor fabrication method with a complex optimized conductive patch pattern from simulation by applying screen printing techniques. The effectiveness of the implemented spatial sensitivity equalization was validated by comparing it to a conventional ERT sensor in both simulations and real-world environments. The proposed sensitivity optimization method expands the use of ERT-based sensors for distributed tactile sensing in physical human-robot interaction scenarios.

17:00-17:05, Paper ThET16.6	Add to My Program
Milli-Scale AcousTac Sensing Using Soft Helmholtz Resonators

Aderibigbe, Jadesola	University of California, Berkeley
Li, Monica	Yale University
Lee, Jungpyo	University of California, Berkeley
Stuart, Hannah	UC Berkeley
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing, Soft Robot Applications Abstract: Acoustic transmission, or sound, can effectively communicate information over distances through various media. We focus on generating acoustic transmission using pneumatically driven resonators for wireless tactile sensing without the need for any electronics at the end-effector or contact point. We explore the relationship between emitted frequency and the geometry of the resonance chamber. When a normal compressive force is applied to the end cap, the compliant resonant cavity deforms, leading to an increase in frequency measurable by an external microphone. Prior work uses tube resonators with fipple attachments. In the present work, we study whether a different smaller audible cylindrical resonator with air blown across the entryway can be utilized instead. We test the utility of the Helmholtz resonator model in predicting the experimental frequency response. Resonance is often modeled for rigid cavities, presenting unique challenges in predicting resonance for the design of soft resonating taxels.

17:05-17:10, Paper ThET16.7	Add to My Program
Enhanced Model-Free Dynamic State Estimation for a Soft Robot Finger Using an Embedded Optical Waveguide Sensor

Krauss, Henrik	Keio University, Faculty of Science and Technology
Takemura, Kenjiro	Keio University
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Machine Learning for Robot Control Abstract: In this letter, an advanced stretchable optical waveguide sensor is implemented into a multidirectional PneuNet soft actuator to enhance dynamic state estimation through a NARX neural network. The stretchable waveguide featuring a semidivided core design from previous work is sensitive to multiple strain modes. It is integrated into a soft finger actuator with two pressure chambers that replicates human finger motions. The soft finger, designed for applications in soft robotic grippers or hands, is viewed in isolation under pneumatic actuation controlled by motorized linear stages. The research first characterizes the soft finger's workspace and sensor response. Subsequently, three dynamic state estimators are developed using NARX architecture, differing in the degree of incorporating the optical waveguide sensor response. Evaluation on a testing path reveals that the full sensor response significantly improves end effector position estimation, reducing mean error by 51% from 5.70 mm to 2.80 mm, compared to only 21% improvement to 4.53 mm using the estimator representing a single core waveguide design. The letter concludes by discussing the application of these estimators for (open-loop) model-predictive control and recommends future focus on advanced, structured soft (optical) sensors for model-free state estimation and control of soft robots.


ThET17 Regular Session, 405	Add to My Program
Design and Control

Chair: Le Goff, Leni Kenneth	Edinburgh Napier University
Co-Chair: Padir, Taskin	Northeastern University

16:35-16:40, Paper ThET17.1	Add to My Program
Efficient and Diverse Generative Robot Designs Using Evolution and Intrinsic Motivation

Le Goff, Leni Kenneth	UPMC
Smith, Simón C.	Edinburgh Napier University
Keywords: Evolutionary Robotics, Methods and Tools for Robot System Design, Embodied Cognitive Science Abstract: Methods for generative design of robot physical configurations can automatically find optimal and innovative solutions for challenging tasks in complex environments. The vast search-space includes the physical design-space and the controller parameter-space, making it a challenging problem in machine learning and optimisation in general. Evolutionary algorithms (EAs) have shown promising results in generating robot designs via gradient-free optimisation. Morpho-evolution with learning (MEL) uses EAs to concurrently generate robot designs and learn the optimal parameters of the controllers. Two main issues prevent MEL from scaling to higher complexity tasks: i) computational cost and ii) premature convergence to sub-optimal designs. To address these issues, we propose combining morpho-evolution with intrinsic motivations. Intrinsically motivated behaviour arises from embodiment and simple learning rules without external guidance. We use a homeokinetic controller that generates exploratory behaviour in a few seconds with minimal knowledge of the robot’s design. Homeokinesis replaces costly learning phases, reducing computational time and favouring diversity, preventing premature convergence. We compare our approach with current MEL methods in several downstream tasks. The generated designs score higher in all the tasks, are more diverse, and are quickly generated compared to morpho-evolution with static parameters. Source and containers available at github.com/AutonomousRoboticEvolution.

16:40-16:45, Paper ThET17.2	Add to My Program
A Novel Hybrid Hysteresis Modeling Method for Multiloop-Asymmetry Hysteresis Behavior of Nonlinear Compliant Actuators

Zhou, Libo	Zhejiang University of Technology
Xu, Lingpeng	Zhejiang University of Technology
Ou, Linlin	Zhejiang University of Technology
Yu, Xinyi	Zhejiang University of Technology
Feng, Yalei	Midea Group
Bai, Shaoping	Aalborg University
Keywords: Compliant Joints and Mechanisms, Prosthetics and Exoskeletons, Wearable Robotics Abstract: Nonlinear compliant actuators are being increasingly used in human-robot interaction scenarios due to their inherent flexibility. However, a limitation is that nonlinear hysteresis exists, which will degrade the force/torque tracking performance if the hysteresis is not modeled accurately. Moreover, the existing methods are difficult to deal with the multi-loop asymmetry hysteresis. In this work, we present a novel modeling method, in which the hysteresis curves are decoupled into nonlinear reference lines and symmetrical hysteresis loops. A hybrid hysteresis model based on power function and Maxwellslip model is then developed to fit the nonlinear reference lines and symmetrical hysteresis loops respectively. Experiments were conducted on a nonlinear compliant actuator and the results show that the root-mean-square-errors (RMSE) of the hysteresis model decreases by 24.4% when compared with the Maxwellslip based hysteresis model.

16:45-16:50, Paper ThET17.3	Add to My Program
Dynamic Mode Decomposition with Sonomyography and Electromyography for Predictive Modeling of Lower Limb Exoskeleton Walking

Lambeth, Krysten	North Carolina State University
Xue, Xiangming	North Carolina State University
Singh, Mayank	North Carolina State University
Huang, He (Helen)	North Carolina State University and University of North Carolina
Sharma, Nitin	North Carolina State University
Keywords: Model Learning for Control, Prosthetics and Exoskeletons, Rehabilitation Robotics Abstract: The nonlinear dynamics required to model walking with multi-joint lower limb exoskeleton assistance results in high computational burden. To address this, we derive a Koopman-based linearized model of the human-exoskeleton system using electromyography and ultrasound-derived metrics of volitional muscle activity during exoskeleton-assisted walking. Data are collected from one participant with spinal cord injury (SCI) and two participants with no disabilities. Various electromyography and ultrasound-derived features in addition to normalized motor currents are used to derive predictive models, and we identify which muscle activation metrics produce the most accurate model for each subject. For both subjects without disabilities, the most accurate model uses only ultrasound-derived echogenicity as a metric of muscle activity, while the most accurate model for the subject with SCI uses only EMG wave length. Furthermore, the inclusion of ground reaction force increases the prediction accuracy of all models for one participant with no disabilities while decreasing the accuracy of most models for the participant with SCI. For all subjects, the most accurate subject-specific linear model has a root-mean-square error (averaged across limb segment angles) of <8°.

16:50-16:55, Paper ThET17.4	Add to My Program
Data-Driven Sampling Based Stochastic MPC for Skid-Steer Mobile Robot Navigation

Trivedi, Ananya	Northeastern University
Prajapati, Sarvesh	Northeastern University
Shirgaonkar, Anway Prasad	Northeastern University
Zolotas, Mark	Toyota Research Institute
Padir, Taskin	Northeastern University
Keywords: Model Learning for Control, Planning under Uncertainty, Robust/Adaptive Control Abstract: Traditional approaches to motion modeling for skid-steer robots struggle to capture nonlinear tire-terrain dynamics, especially during high-speed maneuvers. In this paper, we tackle such nonlinearities by enhancing a dynamic unicycle model with Gaussian Process (GP) regression outputs. This enables us to develop an adaptive, uncertainty-informed navigation formulation. We solve the resultant stochastic optimal control problem using a chance-constrained Model Predictive Path Integral (MPPI) control method. This approach formulates obstacle avoidance and path-following as chance constraints, accounting for residual uncertainties from the GP to ensure safety and reliability in control. Leveraging GPU acceleration, we efficiently manage the non-convex nature of the problem, ensuring real-time performance. Our approach unifies path-following and obstacle avoidance across different terrains, unlike prior works which typically focus on one or the other. We compare our GP-MPPI method against unicycle and data-driven kinematic models within the MPPI framework. In simulations, our approach shows superior tracking accuracy and obstacle avoidance. We further validate our approach through hardware experiments on a skid-steer robot platform, demonstrating its effectiveness in high-speed navigation. The GPU implementation of the proposed method and supplementary video footage are available at https://stochasticmppi.github.io.

16:55-17:00, Paper ThET17.5	Add to My Program
Agile Mobility with Rapid Online Adaptation Via Meta-Learning and Uncertainty-Aware MPPI

Kalaria, Dvij	Carnegie Mellon University
Xue, Haoru	University of California Berkeley
Xiao, Wenli	Carnegie Mellon University
Tao, Tony	Carnegie Mellon University
Shi, Guanya	Carnegie Mellon University
Dolan, John M.	Carnegie Mellon University
Keywords: Robust/Adaptive Control, Machine Learning for Robot Control, Representation Learning Abstract: Modern non-linear model-based controllers require an accurate physics model and model parameters to be able to control mobile robots at their limits. Also, due to surface slipping at high speeds, the friction parameters may continually change (like tire degradation in autonomous racing), and the controller may need to adapt rapidly. Many works derive a task-specific robot model with a parameter adaptation scheme that works well for the task but requires a lot of effort and tuning for each platform and task. In this work, we design a full model-learning-based controller based on meta pre-training that can very quickly adapt using few-shot dynamics data to any wheel-based robot with any model parameters, while also reasoning about model uncertainty. We demonstrate our results in small-scale numeric simulation, the large-scale Unity simulator, and on a medium-scale hardware platform with a wide range of settings. We show that our results are comparable to domain-specific well-engineered controllers, and have excellent generalization performance across all scenarios

17:00-17:05, Paper ThET17.6	Add to My Program
Variable Transmission Mechanisms for Robotic Applications: A Review

Park, Jihyuk	Yeungnam University
Lee, Joon	Sogang University
Seo, Hyung-Tae	Kyonggi University
Jeong, Seokhwan	Mechanical Eng., Sogang University
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Compliant Joints and Mechanisms Abstract: Actuators play a crucial role in robotics, determining the force and speed capabilities necessary for varied tasks, directly affecting the performance of the robotic system. With the growing reliance on robotics in both industrial applications and daily life, innovative actuator research has expanded significantly. Despite advances, traditional actuators encounter limitations in performance and operational range due to inherent physical constraints. To address these challenges, variable transmission mechanisms (VTMs) have emerged over the past decade as one of the alternative solutions, enhancing the adaptability and efficiency of robotic systems. However, there is currently a lack of survey articles that comprehensively cover the mechanisms and working principles of VTMs in robotics. This review article fills this gap by offering an extensive analysis of VTM applications in robotics. It categorizes VTMs based on their mechanisms and principles, presents case studies on both commercial and experimental VTMs, and provides insights into future

17:05-17:10, Paper ThET17.7	Add to My Program
Continuously Variable Transmission and Stiffness Actuator Based on Actively Variable Four-Bar Linkage for Highly Dynamic Robot Systems

Hur, Jungwoo	Sogang University
Song, Hangyeol	Georgia Institute of Technology
Jeong, Seokhwan	Mechanical Eng., Sogang University
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Compliant Joints and Mechanisms Abstract: This paper presents a novel actuation mechanism that combines a continuously variable transmission (CVT) mechanism with a variable stiffness actuator (VSA) for highly dynamic robot systems such as legged robots. The CVT effectively changes the input-output transmission ratio of the system, thereby extending the operational torque-speed range. Concurrently, the VSA adjusts the system stiffness, altering its compliance characteristics. Both CVT and VSA are seamlessly integrated into a single four-bar linkage mechanism, with their active features enabled by an actively variable link within this linkage. This CVT-VSA mechanism offers a range of dynamic advantages by inversely varying transmission ratio and stiffness, which includes impact mitigation, torque or speed amplification, and expanded control bandwidth. The implementation and efficacy of the CVT-VSA mechanism in a legged robot were tested and validated through a series of experiments.


ThET18 Regular Session, 406	Add to My Program
Planning under Uncertainty 3

Chair: Kurniawati, Hanna	Australian National University
Co-Chair: Fridovich-Keil, David	The University of Texas at Austin

16:35-16:40, Paper ThET18.1	Add to My Program
A Data-Driven Aggressive Autonomous Racing Framework Utilizing Local Trajectory Planning with Velocity Prediction

Li, Zhouheng	Zhejiang University
Zhou, Bei	Zhejiang University
Hu, Cheng	Zhejiang University
Xie, Lei	Zhejiang University
Su, Hongye	Zhejiang University
Keywords: Integrated Planning and Learning, Integrated Planning and Control, Constrained Motion Planning Abstract: The development of autonomous driving has boosted the research on autonomous racing. However, existing local trajectory planning methods have difficulty planning trajectories with optimal velocity profiles at racetracks with sharp corners, thus weakening the performance of autonomous racing. To address this problem, we propose a local trajectory planning method that integrates Velocity Prediction based on Model Predictive Contouring Control (VPMPCC). The optimal parameters of VPMPCC are learned through Bayesian Optimization (BO) based on a proposed novel Objective Function adapted to Racing (OFR). Specifically, VPMPCC achieves velocity prediction by encoding the racetrack as a reference velocity profile and incorporating it into the optimization problem. This method optimizes the velocity profile of local trajectories, especially at corners with significant curvature. The proposed OFR balances racing performance with vehicle safety, ensuring safe and efficient BO training. In the simulation, the number of training iterations for OFR-based BO is reduced by 42.86% compared to the state-of-the-art method. The optimal simulation-trained parameters are then applied to a real-world F1TENTH vehicle without retraining. During prolonged racing on a custom-built racetrack featuring significant sharp corners, the mean projected velocity of VPMPCC reaches 93.18% of the vehicle's handling limits. The released code is available at https://github.com/zhouhengli/VPMPCC.

16:40-16:45, Paper ThET18.2	Add to My Program
RLPP: A Residual Method for Zero-Shot Real-World Autonomous Racing on Scaled Platforms

Ghignone, Edoardo	ETH
Baumann, Nicolas	ETH
Hu, Cheng	Zhejiang University
Wang, Jonathan	ETH Zurich
Xie, Lei	Zhejiang University
Carron, Andrea	ETH Zurich
Magno, Michele	ETH Zurich
Keywords: Field Robots, Reinforcement Learning, Wheeled Robots Abstract: Autonomous racing presents a complex environment requiring robust controllers capable of making rapid decisions under dynamic conditions. While traditional controllers based on tire models are reliable, they often demand extensive tuning or system identification. RL methods offer significant potential due to their ability to learn directly from interaction, yet they typically suffer from the sim-to-real gap, where policies trained in simulation fail to perform effectively in the real world. In this paper, we propose RLPP, a residual RL framework that enhances a PP controller with an RL-based residual. This hybrid approach leverages the reliability and interpretability of PP while using RL to fine-tune the controller's performance in real-world scenarios. Extensive testing on the F1TENTH platform demonstrates that RLPP improves lap times of the baseline controllers by up to 6.37%, closing the gap to the SotA methods by more than 52% and providing reliable performance in zero-shot real-world deployment, overcoming key challenges associated with the sim-to-real transfer and reducing the performance gap from simulation to reality by more than 8-fold when compared to the baseline RL controller. The RLPP framework is made available as an open-source tool, encouraging further exploration and advancement in autonomous racing research. The code is available at: www.github.com/forzaeth/rlpp.

16:45-16:50, Paper ThET18.3	Add to My Program
Uncertainty-Aware Probabilistic Risk Quantification of SOTIF for Autonomous Vehicles

Yao, Botao	Harbin Institute of Technology
Huang, Shuohan	Harbin Institute of Technology
Liu, Chuanyi	Harbin Institute of Technology
Han, Peiyi	Harbin Institute of Technology
Lin, Jie	Harbin Institute of Technology
Duan, Shaoming	Pengcheng Laboratory
Keywords: Collision Avoidance, Intelligent Transportation Systems, Motion and Path Planning Abstract: Ensuring the Safety of the Intended Functionality (SOTIF) for autonomous vehicles (AVs) is critical. Effective risk assessment helps AVs make decisions and avoid risks. However, existing methods face challenges due to environmental uncertainties, insufficient multi-dimensional risk quantification, and limited predictive accuracy. To address this challenge, we propose an uncertainty-aware probabilistic risk assessment framework that quantifies the risk of AVs violating safety constraints and calculates the expected average severity of such violations in uncertain environments. We first establish a general SOTIF risk model to characterize the static risk of the AV and surrounding traffic participants. Following this, we introduce a method for predicting dynamic uncertainty risks, resulting in probabilistic risk quantification. This framework accounts for multi-dimensional uncertainties and enhances safety under dynamic conditions. Extensive evaluations across typical traffic scenarios—including highways, intersections, and roundabouts—demonstrate that our method outperforms typical algorithms like Time Headway (THW) and Time-to-Collision (TTC). Empirical studies in extreme scenarios further validate the framework's ability to reduce risks and improve system generalization. The related code is available at: https://github.com/idslab-autosec/risk_uncertainty.

16:50-16:55, Paper ThET18.4	Add to My Program
Think Deep and Fast: Learning Neural Nonlinear Opinion Dynamics from Inverse Dynamic Games for Split-Second Interactions

Hu, Haimin	Princeton University
Fernández Fisac, Jaime	Princeton University
Leonard, Naomi	Princeton University
Gopinath, Deepak	Northwestern University
DeCastro, Jonathan	Cornell University
Rosman, Guy	Massachusetts Institute of Technology
Keywords: Motion and Path Planning, Human-Aware Motion Planning, Learning from Demonstration Abstract: Non-cooperative interactions commonly occur in multi-agent scenarios such as car racing, where an ego vehicle can choose to overtake the rival, or stay behind it until a safe overtaking “corridor” opens. While an expert human can do well at making such time-sensitive decisions, autonomous agents are incapable of rapidly reasoning about complex, potentially conflicting options, leading to suboptimal behaviors such as deadlocks. Recently, the nonlinear opinion dynamics (NOD) model has proven to exhibit fast opinion formation and avoidance of decision deadlocks. However, NOD modeling parameters are oftentimes assumed fixed, limiting their applicability in complex and dynamic environments. It remains an open challenge to determine such parameters automatically and adaptively, accounting for the ever-changing environment. In this work, we propose for the first time a learning-based and game-theoretic approach to synthesize a Neural NOD model from expert demonstrations, given as a dataset containing (possibly incomplete) state and action trajectories of interacting agents. We demonstrate Neural NOD’s ability to make fast and deadlock-free decisions in a simulated autonomous racing example. We find that Neural NOD consistently outperforms the state-of-the-art data-driven inverse game baseline in terms of safety and overtaking performance.

16:55-17:00, Paper ThET18.5	Add to My Program
Online Risk-Bounded Graph-Based Local Planning for Autonomous Driving with Theoretical Guarantees

Ahmad, Abdulrahman	Khalifa University of Science and Technology
Khonji, Majid	Khalifa University
Elbassioni, Khaled	Khalifa University of Science and Technology
Dias, Jorge	Khalifa University
Al-Sumaiti, Ameena	Khalifa University
Keywords: Constrained Motion Planning, Collision Avoidance, Planning under Uncertainty Abstract: Risk-bounded motion planning in dynamic environments for autonomous driving presents complex challenges, particularly in solving the nonconvex problem of ensuring continuous, safe, and real-time navigation towards a destination. This paper introduces an online graph-based local planning approach constrained by a user-defined driving style in terms of a risk budget Delta for the entire mission. Our online approach assigns a risk bound to each motion planning decision, ensuring that the total risk consumed remains within Delta. First, we construct a spatial lattice graph that adheres to the vehicle's curvature constraints. Then, the trajectory planning problem is reformulated as an online optimization problem, where decisions must be made sequentially without prior knowledge of future events. Therefore, we propose a reduction to the problem to be online multiple-choice knapsack problem (ON-MCKP), where the knapsack items are candidate paths generated by solving constrained shortest-path problems. To solve the ON-MCKP, we deploy online algorithms that offer theoretical guarantees on the risk allocation throughout the entire mission. The effectiveness of our method is demonstrated empirically, showing significant improvements in the objective without violating safety constraints.

17:00-17:05, Paper ThET18.6	Add to My Program
Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Wang, Xian	Zhejiang University
Zhou, Jin	Zhejiang University
Feng, Yuanli	Zhejiang University
Mei, Jiahao	Zhejiang University of Technology
Chen, Jiming	Zhejiang University
Li, Shuo	Zhejiang University
Keywords: Reinforcement Learning, Motion and Path Planning Abstract: Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations, and enhanced maneuverability in multi-drone systems by applying optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi-agent reinforcement learning for time-optimal multi-drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision-free mechanism inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade-offs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with a low collision rate. Real-world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m × 5.5 m × 2.0 m space across various tracks, relying entirely on onboard computation.

17:05-17:10, Paper ThET18.7	Add to My Program
Kernel-Based Metrics Learning for Uncertain Opponent Vehicle Trajectory Prediction in Autonomous Racing

Lee, Hojin	Ulsan National Institute of Science and Technology
Nam, Youngim	Ulsan National Institute of Science and Technology
Lee, Sanghun	Ulsan Institute of Science and Technology
Kwon, Cheolhyeon	Ulsan National Institute of Science and Technology
Keywords: Planning under Uncertainty, Integrated Planning and Learning, Machine Learning for Robot Control Abstract: Autonomous racing confronts significant challenges in safely overtaking Opponent Vehicles (OVs) that exhibit uncertain trajectories, stemming from unknown driving policies. To address these challenges, this study proposes heterogeneous kernel metrics for Deep Kernel Learning (DKL), designed to robustly capture the diverse driving policies of OVs, and carry out precise trajectory predictions along with the associated uncertainties. A key virtue of the proposed kernel metrics lies in their ability to align similar driving policies and disjoin dissimilar ones in an unsupervised manner, given the observed interactions between the Ego Vehicle (EV) and OVs. The efficacy of the proposed method is substantiated through experimental studies on a 1/10th scale racecar platform, demonstrating improved prediction accuracy and thereby safely overtaking against OVs. Furthermore, our method is computationally efficient for onboard computing units, affirming its viability in fast-paced racing environments.

17:10-17:15, Paper ThET18.8	Add to My Program
Inferring Occluded Agent Behavior in Dynamic Games from Noise Corrupted Observations

Qiu, Tianyu	University of Texas at Austin
Fridovich-Keil, David	The University of Texas at Austin
Keywords: Planning under Uncertainty, Optimization and Optimal Control, Multi-Robot Systems Abstract: In mobile robotics and autonomous driving, it is natural to model agent interactions as the Nash equilibrium of a noncooperative, dynamic game. These methods inherently rely on observations from sensors such as lidars and cameras to identify agents participating in the game and, therefore, have difficulty when some agents are occluded. To address this limitation, this paper presents an occlusion-aware game-theoretic inference method to estimate the locations of potentially occluded agents, and simultaneously infer the intentions of both visible and occluded agents, which best accounts for the observations of visible agents. Additionally, we propose a receding horizon planning strategy based on an occlusion-aware contingency game designed to navigate in scenarios with potentially occluded agents. Monte Carlo simulations validate our approach, demonstrating that it accurately estimates the game model and trajectories for both visible and occluded agents using noisy observations of visible agents. Our planning pipeline significantly enhances navigation safety when compared to occlusion-ignorant baseline as well.


ThET19 Regular Session, 407	Add to My Program
Manufacturing and Processes

Chair: Lennartson, Bengt	Chalmers University of Technology
Co-Chair: Zhou, Zhengxue	University of Liverpool

16:35-16:40, Paper ThET19.1	Add to My Program
Domain Randomization for Object Detection in Manufacturing Applications Using Synthetic Data: A Comprehensive Study

Zhu, Xiaomeng	KTH and Scania CV AB
Henningsson, Jacob	Uppsala University
Li, Duruo	Scania CV AB
Mårtensson, Pär	Scania
Hanson, Lars	Skövde University
Björkman, Mårten	KTH
Maki, Atsuto	KTH Royal Institute of Technology
Keywords: Computer Vision for Manufacturing, Data Sets for Robotic Vision, Computer Vision for Automation Abstract: This paper addresses key aspects of domain randomization in generating synthetic data for manufacturing object detection applications. To this end, we present a comprehensive data generation pipeline that reflects different factors: object characteristics, background, illumination, camera settings, and post-processing. We also introduce the Synthetic Industrial Parts Object Detection dataset (SIP15-OD) consisting of 15 objects from three industrial use cases under varying environments as a test bed for the study, while also employing an industrial dataset publicly available for robotic applications. In our experiments, we present more abundant results and insights into the feasibility as well as challenges of sim-to-real object detection. In particular, we identified material properties, rendering methods, post-processing, and distractors as important factors. Our method, leveraging these, achieves top performance on the public dataset with Yolov8 models trained exclusively on synthetic data; mAP@50 scores of 96.4% for the robotics dataset, and 94.1%, 99.5%, and 95.3% across three of the SIP15-OD use cases, respectively. The results showcase the effectiveness of the proposed domain randomization, potentially covering the distribution close to real data for the applications.

16:40-16:45, Paper ThET19.2	Add to My Program
Component-Aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection

Tong, Xuan	Fudan University
Chang, Yang	Fudan University
Zhao, Qing	Fudan University
Yu, Jiawen	Fudan University
Wang, Boyang	Fudan University
Lin, Junxiong	Fudan University
Lin, Yuxuan	Fudan University
Mai, Xinji	Fudan University
Wang, Haoran	Fudan University
Tao, Zeng	Fudan University
Wang, Yan	Fudan University
Zhang, Wenqiang	Fudan University
Keywords: Computer Vision for Manufacturing, Computer Vision for Automation, Deep Learning for Visual Perception Abstract: Anomaly detection is critical in industrial manufacturing for ensuring product quality and improving efficiency in automated processes. The scarcity of anomalous samples limits traditional detection methods, making anomaly generation essential for expanding the data repository. However, recent generative models often produce unrealistic anomalies increasing false positives, or require real-world anomaly samples for training. In this work, we treat anomaly generation as a compositional problem and propose ComGEN, a component-aware and unsupervised framework that addresses the gap in logical anomaly generation. Our method comprises a multi-component learning strategy to disentangle visual components, followed by subsequent generation editing procedures. Disentangled text-to-component pairs, revealing intrinsic logical constraints, conduct attention-guided residual mapping and model training with iteratively matched references across multiple scales. Experiments on the MVTecLOCO dataset confirm the efficacy of ComGEN, achieving the best AUROC score of 91.2%. Additional experiments on the real-world scenario of Diesel Engine and widely-used MVTecAD dataset demonstrate significant performance improvements when integrating simulated anomalies generated by ComGEN into automated production workflows.

16:45-16:50, Paper ThET19.3	Add to My Program
Use the Force, Bot! - Force-Aware ProDMP with Event-Based Replanning

Lödige, Paul Werner	Karlsruhe Institute of Technology
Li, Maximilian Xiling	Karlsruhe Institute of Technology
Lioutikov, Rudolf	Karlsruhe Institute of Technology
Keywords: Learning from Demonstration, Imitation Learning Abstract: Movement Primitives (MPs) are a well-established method for representing and generating modular robot trajectories. This work presents FA-ProDMP, a novel approach that introduces force awareness to Probabilistic Dynamic Movement Primitives (ProDMP). FA-ProDMP adapts trajectories during runtime to account for measured and desired forces, offering smooth trajectories and capturing position and force correlations across multiple demonstrations. FA-ProDMPs support multiple axes of force, making them agnostic to Cartesian or joint space control. This versatility makes FA-ProDMP a valuable tool for learning contact rich manipulation tasks, such power plug insertion. To reliably evaluate FA-ProDMP, this work additionally introduces a modular, 3D printed task suite called POEMPEL, inspired by the popular Lego Technic pins. POEMPEL mimics industrial peg-in-hole assembly tasks with force requirements and offers multiple parameters of adjustment, such as position, orientation and plug stiffness level, thereby varying the direction and amount of required forces. Our experiments demonstrate that FA-ProDMP outperforms other MP formulations on the POEMPEL setup and a electrical power plug insertion task, thanks to its replanning capabilities based on measured forces. These findings highlight how FA-ProDMP enhances the performance of robotic systems in contact-rich manipulation tasks.

16:50-16:55, Paper ThET19.4	Add to My Program
Reinforcement Learning on Reconfigurable Hardware: Overcoming Material Variability in Laser Material Processing

Masinelli, Giulio	EPFL
Rajani, Chang	Swiss Federal Laboratories for Materials Science and Technology
Hoffmann, Patrik	Empa
Wasmer, Kilian	EMPA
Atienza, David	Epfl Sti Imt Esl
Keywords: Manufacturing, Maintenance and Supply Chains, Reinforcement Learning, Hardware-Software Integration in Robotics Abstract: Ensuring consistent processing quality is challenging in laser processes due to varying material properties and surface conditions. Although some approaches have shown promise in solving this problem via automation, they often rely on predetermined targets or are limited to simulated environments. To address these shortcomings, we propose a novel real-time reinforcement learning approach for laser process control, implemented on a Field Programmable Gate Array to achieve real-time execution. Our experimental results from laser welding tests on stainless steel samples with a range of surface roughnesses validated the method's ability to adapt autonomously, without relying on reward engineering or prior setup information. Specifically, the algorithm learned the optimal power profile for each unique surface characteristic, demonstrating significant improvements over hand-engineered optimal constant power strategies — up to 23% better performance on rougher surfaces and 7% on mixed surfaces. This approach represents a significant advancement in automating and optimizing laser processes, with potential applications across multiple industries.

16:55-17:00, Paper ThET19.5	Add to My Program
GenCo: A Dual LVLM Generate-Correct Framework for Adaptive Peg-In-Hole Robotics

Zhou, Zhengxue	University of Liverpool
Veeramani, Satheeshkumar	University of Liverpool
Fakhruldeen, Hatem	University of Liverpool
Uyanik, Seda	University of Liverpool
Cooper, Andrew Ian	University of Liverpool
Keywords: Perception-Action Coupling, Cognitive Control Architectures, Industrial Robots Abstract: Recent advances in Vision Language Models (VLMs) have enhanced their application in robotics, encompassing both high-level task planning and low-level action control. Despite their strong performance across various robotic tasks, even for zero-shot scenarios, most VLM applications remain open-loop, adhering to a plan-and-execute paradigm without mechanisms to assess task completion. To address this limitation, we propose GenCo, a Generate-Correct framework designed to automate a peg-in-hole task using a UR5e robot. This framework integrates an VLM-based motion generator and motion expert, working collaboratively to refine and correct actions during robotic task execution. Both VLM agents are fine-tuned using the pre-trained LLaVA, enhancing adaptability and scaling efficiently to diverse tasks. Our experiments demonstrate the adaptiveness of the framework, improving the success rate for the peg-in-hole task by 12.75% compared to a single VLM open-loop method. Notably, in unseen scenarios, the success rate for a triangular peg was increased by 15%, and for a random-shaped peg by 17%, underscoring the system's effectiveness in handling novel tasks. Adaptive testing under varied camera positions demonstrated robust performance, affirming reliability despite shifts in the visual input. The framework is also designed to be lightweight and efficient, facilitating broader adoption and practical deployment. Access to our code and model is provided here: https://github.com/Zhengxuez/generate_correct

17:00-17:05, Paper ThET19.6	Add to My Program
ASCENT: Autonomous Skill Learning Toward Complex Embodied Tasks with Foundation Models

Wu, Haolin	Sun Yat-Sen University
Liu, Yuecheng	Huawei Noah's Ark Lab
Dong, Junyi	Cornell University
Zhang, Heng	Huawei
Mao, Sitong	ShenZhen Huawei Cloud Computing Technologies Co., Ltd
Wang, Hesheng	Shanghai Jiao Tong University
Wu, Weigang	Sun Yat-Sen University
Zhou, Shunbo	Huawei
Keywords: Domestic Robotics Abstract: Collecting data from simulated scenarios for training robotic skills provides a safer and more controllable alternative to real-world environments. However, it demands considerable effort, including the manual construction of simulation environments, the careful design of tasks, and the challenge of obtaining effective trajectories. These limitations hinder the efficiency of data collection from simulated scenarios. In this paper, we leverage the prior knowledge of Large Language Models (LLMs) and Large Multimodal Models (LMMs) to generate simulated scenarios and embodied tasks. We introduce a novel framework, ASCENT (Autonomous Skill learning toward Complex Embodied tasks with fouNdaTion models), designed to efficiently accomplish these tasks and generate trajectory data. ASCENT features a fully autonomous skill learning mechanism based on AI agent. During task training, the AI agent identifies suitable atomic skills from an atomic skill library to either directly complete the task or serve as an initial policy for further training. Newly acquired atomic skills are subsequently added to the library. To address training failures and enhance efficiency, the AI agent uses an LLM to automatically optimize the skill training process based on feedback received from simulations. Experimental results indicate that the number of training steps required for learning new tasks can be reduced by up to 65.9%.

17:05-17:10, Paper ThET19.7	Add to My Program
Ms. NAMI: Multimodal Semantic Navigation on Relative Metric Intention Graph

Zhai, Shichao	Zhejiang University
Cui, Yuxiang	Zhejiang University
Ye, Shuhao	Zhejiang University
Yu, Xuan	Zhejiang University
Mao, Sitong	ShenZhen Huawei Cloud Computing Technologies Co., Ltd
Zhou, Shunbo	Huawei
Xiong, Rong	Zhejiang University
Wang, Yue	Zhejiang University
Keywords: Domestic Robotics, Autonomous Vehicle Navigation, Reinforcement Learning Abstract: Embodied navigation in unknown environments presents the significant challenge of integrating tasks with multimodal goals into a unified framework. In this paper, we propose the Multimodal Semantic Navigation on Relative Metric Intention Graph (Ms. NAMI), a framework that integrates various navigation tasks with multimodal goals based on a relative topo-metric intention graph. A reinforcement learning based policy with a concise action space, consisting of frontier nodes and intention nodes, is designed to guide the agent to select reasonable sub-goals. A sparse reward design is introduced to reduce bias during training. Additionally, several engineering optimizations are implemented to enhance overall performance. The experimental results indicate that our method can achieve robust navigation performance in a variety of unknown environments.


ThET20 Regular Session, 408	Add to My Program
Agricultural Automation 4

Chair: Hauser, Kris	University of Illinois at Urbana-Champaign
Co-Chair: Behley, Jens	University of Bonn

16:35-16:40, Paper ThET20.1	Add to My Program
Towards Autonomous Crop Monitoring: Inserting Sensors in Cluttered Environments

Lee, Moonyoung	Carnegie Mellon University
Berger, Aaron	Harvard University
Guri, Dominic	Carnegie Mellon University
Zhang, Kevin	Carnegie Mellon University
Coffey, Lisa	Iowa State University
Kantor, George	Carnegie Mellon University
Kroemer, Oliver	Carnegie Mellon University
Keywords: Agricultural Automation, Robotics and Automation in Agriculture and Forestry, Hardware-Software Integration in Robotics Abstract: Monitoring crop nutrients can aid farmers in optimizing fertilizer use. Many existing robots rely on visionbased phenotyping, however, which can only indirectly estimate nutrient deficiencies once crops have undergone visible color changes. We present a contact-based phenotyping robot platform that can directly insert nitrate sensors into cornstalks to proactively monitor macronutrient levels in crops. This task is challenging because inserting such sensors requires subcentimeter precision in an environment which contains high levels of clutter, lighting variation, and occlusion. To address these challenges, we develop a robust perception-action pipeline to grasp stalks, and create a custom robot gripper which mechanically aligns the sensor before inserting it into the stalk. Through experimental validation on 48 unique stalks in a cornfield in Iowa, we demonstrate our platform’s capability of detecting a stalk with 94% success, grasping a stalk with 90% success, and inserting a sensor with 60% success. In addition to developing an autonomous phenotyping research platform, we share key challenges and insights obtained from deployment in the field. Our research platform is open-sourced, with additional information available at https://kantor-lab.github.io/cornbot.

16:40-16:45, Paper ThET20.2	Add to My Program
A Dataset and Benchmark for Shape Completion of Fruits for Agricultural Robotics

Magistri, Federico	University of Bonn
Läbe, Thomas	University of Bonn
Marks, Elias Ariel	University of Bonn
Nagulavancha, Sumanth	University of Bonn
Pan, Yue	University of Bonn
Smitt, Claus	University of Bonn
Klingbeil, Lasse	University of Bonn
Halstead, Michael Allan	Bonn University
Kuhlmann, Heiner	University of Bonn
McCool, Christopher Steven	University of Bonn
Behley, Jens	University of Bonn
Stachniss, Cyrill	University of Bonn
Keywords: Robotics and Automation in Agriculture and Forestry, Data Sets for Robotic Vision, Agricultural Automation Abstract: As the world population is expected to reach 10 billion by 2050, our agricultural production system needs to double its productivity despite a decline of human workforce in the agricultural sector. Autonomous robotic systems are one promising pathway to increase productivity by taking over labor-intensive manual tasks like fruit picking. To be effective, such systems need to monitor and interact with plants and fruits precisely, which is challenging due to the cluttered nature of agricultural environments causing, for example, strong occlusions. Thus, being able to estimate the complete 3D shapes of objects in presence of occlusions is crucial for automating operations such as fruit harvesting. In this paper, we propose the first publicly available 3D shape completion dataset for agricultural vision systems. We provide an RGB-D dataset for estimating the 3D shape of fruits. Specifically, our dataset contains RGB-D frames of single sweet peppers in lab conditions but also in a commercial greenhouse. For each fruit, we additionally collected high-precision point clouds that we use as ground truth. For acquiring the ground truth shape, we developed a measuring process that allows us to record data of real sweet pepper plants, both in the lab and in the greenhouse with high precision, and determine the shape of the sensed fruits. We release our dataset, consisting of almost 7,000 RGB-D frames belonging to more than 100 different fruits. We provide segmented RGB-D frames, with camera intrinsics to easily obtain colored point clouds, together with the corresponding high-precision, occlusion-free point clouds obtained with a high-precision laser scanner. We additionally enable evaluation of shape completion approaches on a hidden test set through a public challenge on a benchmark server.

16:45-16:50, Paper ThET20.3	Add to My Program
A Novel Control Strategy for Offset Points Tracking in the Context of Agricultural Robotics

Ngnepiepaye Wembe, Stephane	University of Clermont Auvergne, French National Research Instit
Rousseau, Vincent	IRSTEA
Laconte, Johann	French National Research Institute for Agriculture, Food and The
Lenain, Roland	INRAE
Keywords: Agricultural Automation, Motion Control, Robotics and Automation in Agriculture and Forestry Abstract: In this paper, we present a novel method to control a rigidly connected location on the vehicle, such as a point on the implement in case of agricultural tasks. Agricultural robots are transforming modern farming by enabling precise and efficient operations, replacing humans in arduous tasks while reducing the use of chemicals. Traditionally, path-following algorithms are designed to guide the vehicle’s center along a predefined trajectory. However, since the actual agronomic task is performed by the implement, it is essential to control a specific point on the tool itself rather than the vehicle’s center. As such, we present in this paper two approaches for achieving the control of an offset point on the robot. The first approach adapts existing control laws, initially intended for the rear axle’s midpoint, to manage the desired lateral deviation. The second approach employs backstepping control techniques to create a control law that directly targets the implement. We conduct real-world experiments, highlighting the limitations of traditional approaches for offset point control, and demonstrating the strengths and weaknesses of the proposed methods.

16:50-16:55, Paper ThET20.4	Add to My Program
Towards Over-Canopy Autonomous Navigation: Crop-Agnostic LiDAR-Based Crop-Row Detection in Arable Fields

Liu, Ruiji	Carnegie Mellon University
Yandun, Francisco	Carnegie Mellon University
Kantor, George	Carnegie Mellon University
Keywords: Agricultural Automation, Reactive and Sensor-Based Planning, Field Robots Abstract: Autonomous navigation is crucial for various robotics applications in agriculture. However, many existing methods depend on RTK-GPS devices, which can be susceptible to loss of radio signal or intermittent reception of corrections from the internet. Consequently, research has increasingly focused on using RGB cameras for crop-row detection, though challenges persist when dealing with grown plants. This paper introduces a LiDAR-based navigation system that can achieve crop-agnostic over-canopy autonomous navigation in row-crop fields, even when the canopy fully blocks the inter-row spacing. Our algorithm can detect crop rows across diverse scenarios, encompassing various crop types, growth stages, illumination conditions, the presence of weeds, curved rows, and discontinuities. Without utilizing a global localization method (i.e., based on GPS), our navigation system can perform autonomous navigation in these challenging scenarios, detect the end of the crop rows, and navigate to the next crop row autonomously, providing a crop-agnostic approach to navigate an entire field. The proposed navigation system has undergone tests in various simulated and real agricultural fields, achieving an average cross-track error of 3.55 cm without human intervention. The system has been deployed on a customized UGV robot, which can be reconfigured depending on the field conditions.

16:55-17:00, Paper ThET20.5	Add to My Program
Safe Leaf Manipulation for Accurate Shape and Pose Estimation of Occluded Fruits

Yao, Shaoxiong	University of Illinois Urbana-Champaign
Pan, Sicong	University of Bonn
Bennewitz, Maren	University of Bonn
Hauser, Kris	University of Illinois at Urbana-Champaign
Keywords: Agricultural Automation Abstract: Fruit monitoring plays an important role in crop management, and rising global fruit consumption combined with labor shortages necessitates automated monitoring with robots. However, occlusions from plant foliage often hinder accurate shape and pose estimation. Therefore, we propose an active fruit shape and pose estimation method that physically manipulates occluding leaves to reveal hidden fruits. This paper introduces a framework that plans robot actions to maximize visibility and minimize leaf damage. We developed a novel scene-consistent shape completion technique to improve fruit estimation under heavy occlusion and utilize a perception-driven deformation graph model to predict leaf deformation during planning. Experiments on artificial and real sweet pepper plants demonstrate that our method enables robots to safely move leaves aside, exposing fruits for accurate shape and pose estimation, outperforming baseline methods. Project page: https://shaoxiongyao.github.io/lmap-ssc/.

17:00-17:05, Paper ThET20.6	Add to My Program
Autonomous Sensor Exchange and Calibration for Cornstalk Nitrate Monitoring Robot

Lee, Janice Seungyeon	Carnegie Mellon University
Detlefsen, Thomas	Carnegie Mellon University
Lawande, Shara	Carnegie Mellon University
Ghatge, Saudamini	Carnegie Mellon University
Ramesh Shanthi, Shrudhi	Carnegie Mellon University
Mukkamala, Sruthi	Carnegie Mellon University
Kantor, George	Carnegie Mellon University
Kroemer, Oliver	Carnegie Mellon University
Keywords: Robotics and Automation in Agriculture and Forestry, Grippers and Other End-Effectors, Agricultural Automation Abstract: Interactive sensors are an important component of robotic systems but often require manual replacement due to wear and tear. Automating this process can enhance system autonomy and facilitate long-term deployment. We developed an autonomous sensor exchange and calibration system for an agriculture crop monitoring robot that inserts a nitrate sensor into cornstalks. A novel gripper and replacement mechanism, featuring a reliable funneling design, were developed to enable efficient and reliable sensor exchanges. To maintain consistent nitrate sensor measurement, an on-board sensor calibration station was integrated to provide in-field sensor cleaning and calibration. The system was deployed at the Ames Curtis Farm in June 2024, where it successfully inserted nitrate sensors with high accuracy into 30 cornstalks with a 77% success rate.

17:05-17:10, Paper ThET20.7	Add to My Program
Enhancing Agricultural Environment Perception Via Active Vision and Zero-Shot Learning

La Greca, Michele Carlo	Politecnico Di Milano
Usuelli, Mirko	Politecnico Di Milano
Matteucci, Matteo	Politecnico Di Milano
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, RGB-D Perception Abstract: Agriculture, fundamental for human sustenance, faces unprecedented challenges. The need for efficient, human-cooperative, and sustainable farming methods has never been greater. The core contributions of this work involve leveraging Active Vision (AV) techniques and Zero-Shot Learning (ZSL) to improve the robot's ability to perceive and interact with agricultural environment in the context of fruit harvesting. The AV Pipeline implemented within ROS 2 integrates the Next-Best View (NBV) Planning for 3D environment reconstruction through a dynamic 3D Occupancy Map. Our system allows the robotics arm to dynamically plan and move to the most informative viewpoints and explore the environment, updating the 3D reconstruction using semantic information produced through ZSL models. Simulation and real-world experimental results demonstrate our system's effectiveness in complex visibility conditions, outperforming traditional and static predefined planning methods. ZSL segmentation models employed, such as YOLO World + EfficientViT SAM, exhibit high-speed performance and accurate segmentation, allowing flexibility when dealing with semantic information in unknown agricultural contexts without requiring any fine-tuning process.

17:10-17:15, Paper ThET20.8	Add to My Program
CitDet: A Benchmark Dataset for Citrus Fruit Detection

James, Jordan	University of Texas at Arlington
Manching, Heather K.	North Carolina State University
Mattia, Matthew R.	USDA Agricultural Research Service
Bowman, Kim D.	USDA Agricultural Research Service
Hulse-Kemp, Amanda M.	US Department of Agriculture
Beksi, William J.	The University of Texas at Arlington
Keywords: Agricultural Automation, Data Sets for Robotic Vision, Deep Learning for Visual Perception Abstract: In this letter, we present a new dataset to advance the state of the art in detecting citrus fruit and accurately estimate yield on trees affected by the Huanglongbing (HLB) disease in orchard environments via imaging. Despite the fact that significant progress has been made in solving the fruit detection problem, the lack of publicly available datasets has complicated direct comparison of results. For instance, citrus detection has long been of interest to the agricultural research community, yet there is an absence of work, particularly involving public datasets of citrus affected by HLB. To address this issue, we enhance state-of-the-art object detection methods for use in typical orchard settings. Concretely, we provide high-resolution images of citrus trees located in an area known to be highly affected by HLB, along with high-quality bounding box annotations of citrus fruit. Fruit on both the trees and the ground are labeled to allow for identification of fruit location, which contributes to advancements in yield estimation and potential measure of HLB impact via fruit drop. The dataset consists of over 32,000 bounding box annotations for fruit instances contained in 579 high-resolution images. In summary, our contributions are the following: (i) we introduce a novel dataset along with baseline performance benchmarks on multiple contemporary object detection algorithms, (ii) we show the ability to accurately capture fruit location on tree or on ground, and finally (ii) we present a correlation of our results with yield estimations.


ThET21 Regular Session, 410	Add to My Program
Integrating Motion Planning and Learning 3

Chair: Balakirsky, Stephen	Georgia Tech
Co-Chair: Solovey, Kiril	Technion--Israel Institute of Technology

16:35-16:40, Paper ThET21.1	Add to My Program
Transformer-Enhanced Motion Planner: Attention-Guided Sampling for State-Specific Decision Making

Zhuang, Lei	Harbin Institute of Technology
Zhao, Jingdong	Harbin Institute of Technology
Li, Yuntao	Harbin Institute of Technology
Xu, Zichun	Harbin Institute of Technology, School of Mechatronics Engineeri
Zhao, Liangliang	Harbin Institute of Technology
Liu, Hong	Harbin Institute of Technology
Keywords: Motion and Path Planning, Deep Learning Methods Abstract: Sampling-based motion planning (SBMP) algorithms are renowned for their robust global search capabilities. However, the inherent randomness in their sampling mechanisms often results in inconsistent path quality and limited search efficiency. In response to these challenges, this work proposes a novel deep learning-based motion planning framework, named Transformer-Enhanced Motion Planner (TEMP), which synergizes a Co-Regulation Environmental Information Encoder (CEIE) with a Motion Planning Transformer (MPT). CEIE converts scenario data into encoded environmental information (EEI), providing MPT with an insightful understanding of the environment. MPT leverages an attention mechanism to dynamically recalibrate its focus on EEI, task objectives, and historical planning data, refining the sampling node generation. To demonstrate the capabilities of TEMP, we train our model using a dataset consisting of planning results produced by RRT*. CEIE and MPT are collaboratively trained, enabling CEIE to autonomously learn and extract patterns from environmental data, thereby forming informative representations that MPT can more effectively interpret and utilize for motion planning. Subsequently, we systematically evaluate TEMP's efficacy across diverse dimensions and assess it in out-of-distribution real-world scenarios, demonstrating that TEMP achieves exceptional performance metrics and a heightened degree of generalizability compared to state-of-the-art SBMPs.

16:40-16:45, Paper ThET21.2	Add to My Program
From Configuration-Space Clearance to Feature-Space Margin: Sample Complexity in Learning-Based Collision Detection

Tubul, Sapir	Technion - Israel Institute of Technology
Tamar, Aviv	Technion
Solovey, Kiril	Technion--Israel Institute of Technology
Salzman, Oren	Technion
Keywords: Integrated Planning and Learning, Probability and Statistical Methods, Collision Avoidance Abstract: Motion planning is a central challenge in robotics, with learning-based approaches gaining significant attention in recent years. Our work focuses on a specific aspect of these approaches: using machine-learning techniques, particularly Support Vector Machines (SVM), to evaluate whether robot configurations are collision free, an operation termed “collision detection”. Despite the growing popularity of these methods, there is a lack of theory supporting their efficiency and prediction accuracy. This is in stark contrast to the rich theoretical results of machine-learning methods in general and of SVMs in particular. Our work bridges this gap by analyzing the sample complexity of an SVM classifier for learning-based collision detection in motion planning. We bound the number of samples needed to achieve a specified accuracy at a given confidence level. This result is stated in terms relevant to robot motion planning such as the system’s clearance. Building on these theoretical results, we propose a collision-detection algorithm that can also provide statistical guarantees on the algorithm’s error in classifying robot configurations as collision-free or not.

16:45-16:50, Paper ThET21.3	Add to My Program
CTSAC: Curriculum-Based Transformer Soft Actor-Critic for Goal-Oriented Robot Exploration

Yang, Chunyu	China University of Mining and Technology
Bi, Shengben	China University of Mining and Technology
Xu, Yihui	China University of Mining and Technology
Zhang, Xin	China University of Mining and Technology
Keywords: Integrated Planning and Learning, Reinforcement Learning, Planning under Uncertainty Abstract: With the increasing demand for efficient and flexible robotic exploration solutions, Reinforcement Learning (RL) is becoming a promising approach in the field of autonomous robotic exploration. However, current RL-based exploration algorithms often face limited environmental reasoning capabilities, slow convergence rates, and substantial challenges in Sim-To-Real (S2R) transfer. To address these issues, we propose a Curriculum Learning-based Transformer Reinforcement Learning Algorithm (CTSAC) aimed at improving both exploration efficiency and transfer performance. To enhance the robot's reasoning ability, a Transformer is integrated into the perception network of the Soft Actor-Critic (SAC) framework, leveraging historical information to improve the farsightedness of the strategy. A periodic review-based curriculum learning is proposed, which enhances training efficiency while mitigating catastrophic forgetting during curriculum transitions. Training is conducted on the ROS-Gazebo continuous robotic simulation platform, with LiDAR clustering optimization to further reduce the S2R gap. Experimental results demonstrate the CTSAC algorithm outperforms the state-of-the-art non-learning and learning-based algorithms in terms of success rate and success rate-weighted exploration time. Moreover, real-world experiments validate the strong S2R transfer capabilities of CTSAC.

16:50-16:55, Paper ThET21.4	Add to My Program
Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Yang, Zhutian	Massachusetts Institute of Technology
Garrett, Caelan	NVIDIA
Fox, Dieter	University of Washington
Lozano-Perez, Tomas	MIT
Kaelbling, Leslie	MIT
Keywords: Integrated Planning and Learning, Task and Motion Planning, Mobile Manipulation Abstract: Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted. Task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We leverage the VLM for 1) system dynamics (i.e. recipe) and 2) search help. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate intermediate subgoals that guide the sampling of a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLM-TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. We found that VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences (success rate 50 to 100% versus 0%, average task completion percentage 72 to 100% versus 15 to 45%). See the project site https://zt-yang.github.io/vlm-tamp-robot/ for more information.

16:55-17:00, Paper ThET21.5	Add to My Program
CrowdSurfer: Sampling Optimization Augmented with Vector-Quantized Variational AutoEncoder for Dense Crowd Navigation

Kumar, Naman	Robotics Research Center, IIIT Hyderabad, India
Singha, Antareep	Robotics Research Center, IIIT Hyderabad
Nanwani, Laksh	Robotics Research Center, IIIT Hyderabad, India
Potdar, Dhruv	Robotics Research Center, IIIT Hyderabad, India
Ramakrishnan, Tarun	Robotics Research Center, IIIT Hyderabad, India
Rastgar, Fatemeh	Örebro University
Idoko, Simon	University of Tartu
Singh, Arun Kumar	University of Tartu
Krishna, Madhava	IIIT Hyderabad
Keywords: Integrated Planning and Learning, Collision Avoidance, Motion and Path Planning Abstract: Navigation amongst densely packed crowds remains a challenge for mobile robots. The complexity increases further if the environment layout changes, making the prior computed global plan infeasible. In this paper, we show that it is possible to dramatically enhance crowd navigation by just improving the local planner. Our approach combines generative modelling with inference time optimization to generate sophisticated long-horizon local plans at interactive rates. More specifically, we train a Vector Quantized Variational AutoEncoder to learn a prior over the expert trajectory distribution conditioned on the perception input. At run-time, this is used as an initialization for a sampling-based optimizer for further refinement. Our approach does not require any sophisticated prediction of dynamic obstacles and yet provides state-of-the-art performance. In particular, we compare against the recent DRL-VO approach and show a 40% improvement in success rate and a 6% improvement in travel time.

17:00-17:05, Paper ThET21.6	Add to My Program
CLIMB: Language-Guided Continual Learning for Task Planning with Iterative Model Building

Byrnes, Walker	Georgia Institute of Technology
Bogdanovic, Miroslav	University of Toronto
Balakirsky, Avi	The Ohio State University
Balakirsky, Stephen	Georgia Tech
Garg, Animesh	Georgia Institute of Technology
Keywords: Integrated Planning and Learning, Continual Learning, Incremental Learning Abstract: Intelligent and reliable task planning is a core capability for generalized robotics, which requires a descriptive domain representation that sufficiently models all object and state information for the scene. We present CLIMB, a continual learning framework for robot task planning that leverages foundation models and feedback from execution to guide the construction of domain models. CLIMB can build a model from a natural language description, learn non-obvious predicates while solving tasks, and store that information for future problems. We demonstrate the ability of CLIMB to improve performance in common planning environments compared to baseline methods. We also developed the BlocksWorld++ domain, a simulated environment with an easily usable real counterpart, together with a curriculum of tasks with progressing difficulty to evaluate continual learning.

17:05-17:10, Paper ThET21.7	Add to My Program
Safe Multi-Agent Navigation Guided by Goal-Conditioned Safe Reinforcement Learning

Feng, Meng	MIT
Parimi, Viraj	Massachusetts Institute of Technology
Williams, Brian	MIT
Keywords: Integrated Planning and Learning, Robot Safety, Reinforcement Learning Abstract: Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods are effective for solving long-horizon tasks but depend on the availability of a graph representation with predefined distance metrics. In contrast, safe Reinforcement Learning (RL) is capable of learning complex behaviors without relying on manual heuristics but fails to solve long-horizon tasks, particularly in goal-conditioned and multi-agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal-conditioned RL (GCRL) and safe RL to learn a goal-conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self-training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint-based plan that the agent then executes by following those waypoints sequentially until their goal locations are reached. This graph pruning and planning approach via the learned value functions allows our approach to flexibly balance the trade-off between faster and safer routes especially over extended horizons. Utilizing this unified high-level graph and a shared low-level safe GCRL policy, we extend this approach to address the multi-agent safe navigation problem. In particular, we leverage Conflict-Based Search (CBS) to create waypoint-based plans for multiple agents allowing for their safer navigation over extended horizons. This integration enhances the scalability of goal-conditioned safe RL in multi-agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state-of-the-art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. More details can be found at https://safe-visual-mapf-mers.mit.csail.mit.

17:10-17:15, Paper ThET21.8	Add to My Program
Motion Planning for 2-DOF Transformable Wheel Robots Using Reinforcement Learning

Park, Inha	Hanyang University
Ryu, Sijun	Hanyang University
Won, Jeeho	Hanyang University
Yoon, Hyeongyu	Hanyang University
Kim, SangGyun	Hanyang University
Kim, Hwa Soo	Kyonggi University
Seo, TaeWon	Hanyang University
Keywords: Motion and Path Planning, Reinforcement Learning, Model Learning for Control Abstract: Transformable robots have been developed to perform various tasks using flexible methods. However, the transformation properties present challenges in controlling and planning motion strategies, as the system model changes when transformations occur. To address this issue, we propose a planning framework based on artificial intelligence, called Geometric Manipulability Reinforcement Learning (GM-RL). GM-RL consists of two components: the manipulability estimator and the motion planner. The manipulability estimator employs graph neural networks (GNN) to provide action guidelines based on the dynamic manipulability of the transformable robots. The motion planner generates transformation plans using reinforcement learning (RL). The activation ratio alpha adjusts the ratio of the guideline accepted between the two components. In experiments utilizing a 2-DoF transformable wheel called STEP, GM-RL with alpha=0.5 generated an optimal transformation plan with an average dynamic manipulability measure of 0.0424, the highest measure compared to pure dynamic manipulability and reinforcement learning. A real-world experiment demonstrated that the transformation plan is efficient for overcoming stairs.


ThET22 Regular Session, 411	Add to My Program
Imitation Learning for Manipulation 2

Chair: Martín-Martín, Roberto	University of Texas at Austin
Co-Chair: Hou, Mengxue	University of Notre Dame

16:35-16:40, Paper ThET22.1	Add to My Program
Towards Effective Utilization of Mixed-Quality Demonstrations in Robotic Manipulation Via Segment-Level Selection and Optimization

Chen, Jingjing	Shanghai Jiao Tong University
Fang, Hongjie	Shanghai Jiao Tong University
Fang, Hao-Shu	Massachusetts Institute of Technology
Lu, Cewu	ShangHai Jiao Tong University
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Data is crucial for robotic manipulation, as it underpins the development of robotic systems for complex tasks. While high-quality, diverse datasets enhance the performance and adaptability of robotic manipulation policies, collecting extensive expert-level data is resource-intensive. Consequently, many current datasets suffer from quality inconsistencies due to operator variability, highlighting the need for methods to utilize mixed-quality data effectively. To mitigate these issues, we propose "Select Segments to Imitate" (S2I), a framework that selects and optimizes mixed-quality demonstration data at the segment level, while ensuring plug-and-play compatibility with existing robotic manipulation policies. The framework has three components: demonstration segmentation dividing origin data into meaningful segments, segment selection using contrastive learning to find high-quality segments, and trajectory optimization to refine suboptimal segments for better policy learning. We evaluate S2I through comprehensive experiments in simulation and real-world environments across six tasks, demonstrating that with only 3 expert demonstrations for reference, S2I can improve the performance of various downstream policies when trained with mixed-quality demonstrations. Project website: https://tonyfang.net/s2i/.

16:40-16:45, Paper ThET22.2	Add to My Program
DABI: Evaluation of Data Augmentation Methods Using Downsampling in Bilateral Control-Based Imitation Learning with Images

Kobayashi, Masato	Osaka University
Buamanee, Thanpimon	Osaka University
Uranishi, Yuki	Osaka University
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration Abstract: Autonomous robot manipulation is a complex and continuously evolving robotics field. This paper focuses on data augmentation methods in imitation learning. Imitation learning consists of three stages: data collection from experts, learning model, and execution. However, collecting expert data requires manual effort and is time-consuming. Additionally, as sensors have different data acquisition intervals, preprocessing such as downsampling to match the lowest frequency is necessary. Downsampling enables data augmentation and also contributes to the stabilization of robot operations. In light of this background, this paper proposes the Data Augmentation Method for Bilateral Control-Based Imitation Learning with Images, called "DABI". DABI collects robot joint angles, velocities, and torques at 1000 Hz, and uses images from gripper and environmental cameras captured at 100 Hz as the basis for data augmentation. This enables a tenfold increase in data. In this paper, we collected just 5 expert demonstration datasets. We trained the bilateral control Bi-ACT model with the unaltered dataset and two augmentation methods for comparative experiments and conducted real-world experiments. The results confirmed a significant improvement in success rates, thereby proving the effectiveness of DABI. For additional material, please check:https://mertcookimg.github.io/dabi

16:45-16:50, Paper ThET22.3	Add to My Program
Learning from Imperfect Demonstrations with Self-Supervision for Robotic Manipulation

Wu, Kun	Syracuse University
Liu, Ning	Beijing Innovation Center of Humanoid Robotics
Zhao, Zhen	Midea Group
Qiu, Di	Peking University
Li, Jinming	Shanghai University
Che, Zhengping	X-Humanoid
Xu, Zhiyuan	Midea Group
Tang, Jian	Midea Group (Shanghai) Co., Ltd
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning in Grasping and Manipulation Abstract: Improving data utilization, especially for imperfect data from task failures, is crucial for robotic manipulation due to the challenging, time-consuming, and expensive data collection process in the real world. Current imitation learning (IL) typically discards imperfect data, focusing solely on successful expert data. While reinforcement learning (RL) can learn from explorations and failures, the sim2real gap and its reliance on dense reward and online exploration make it difficult to apply effectively in real-world scenarios. In this work, we aim to conquer the challenge of leveraging imperfect data without the need for reward information to improve the model performance for robotic manipulation in an offline manner. Specifically, we introduce a Self-Supervised Data Filtering framework (SSDF) that combines expert and imperfect data to compute quality scores for failed trajectory segments. High-quality segments from the failed data are used to expand the training dataset. Then, the enhanced dataset can be used with any downstream policy learning method for robotic manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks using the Franka robot arm demonstrated that the SSDF can accurately expand the training dataset with high-quality imperfect data and improve the success rates for all robotic manipulation tasks.

16:50-16:55, Paper ThET22.4	Add to My Program
MATCH POLICY: A Simple Pipeline from Point Cloud Registration to Manipulation Policies

Huang, Haojie	Northeastern University
Liu, Haotian	Worcester Polytechnic Institute
Wang, Dian	Northeastern University
Walters, Robin	Northeastern University
Platt, Robert	Northeastern University
Keywords: Learning from Demonstration, Imitation Learning, Transfer Learning Abstract: Many manipulation tasks require the robot to rearrange objects relative to one another. Such tasks can be described as a sequence of relative poses between parts of a set of rigid bodies. In this work, we propose Match Policy, a simple but novel pipeline for solving high-precision pick and place tasks. Instead of predicting actions directly, our method registers the pick and place targets to the stored demonstrations. This transfers action inference into a point cloud registration task and enables us to realize nontrivial manipulation policies without any training. Match Policy is designed to solve high-precision tasks with a key-frame setting. By leveraging the geometric interaction and the symmetries of the task, it achieves extremely high sample efficiency and generalizability to unseen configurations. We demonstrate its state-of-the-art performance across various tasks on RLbench benchmark compared with several strong baselines and test it on a real robot with six tasks.

16:55-17:00, Paper ThET22.5	Add to My Program
Self-Improving Autonomous Underwater Manipulation

Liu, Ruoshi	Columbia University
Ha, Huy	Columbia University
Hou, Mengxue	University of Notre Dame
Song, Shuran	Stanford University
Vondrick, Carl	Columbia
Keywords: Sensorimotor Learning, Marine Robotics, Imitation Learning Abstract: Underwater robotic manipulation faces significant challenges due to complex fluid dynamics and unstructured environments, causing most manipulation systems to rely heavily on human teleoperation. In this paper, we introduce AquaBot, a fully autonomous manipulation system that combines behavior cloning from human demonstrations with self-learning optimization to improve beyond human teleoperation performance. With extensive real-world experiments, we demonstrate AquaBot's versatility across diverse manipulation tasks, including object grasping, trash sorting, and rescue retrieval. Our real-world experiments show that AquaBot's self-optimized policy outperforms a human operator by 41% in speed. AquaBot represents a promising step towards autonomous and self-improving underwater manipulation systems.

17:00-17:05, Paper ThET22.6	Add to My Program
DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation Via Imitation Learning

Jiang, Zhenyu	The Unversity of Texas at Austin
Xie, Yuqi	University of Texas at Austin
Lin, Kevin	Stanford
Xu, Zhenjia	Columbia University
Wan, Weikang	Peking University
Mandlekar, Ajay Uday	NVIDIA
Fan, Linxi	Stanford University
Zhu, Yuke	The University of Texas at Austin
Keywords: Imitation Learning, Big Data in Robotics and Automation, Learning from Demonstration Abstract: Imitation learning from human demonstrations is an effective means to teach robots manipulation skills. But data acquisition is a major bottleneck in applying this paradigm more broadly, due to the high costs and human efforts involved. There has been significant interest in imitation learning for bimanual dexterous robots, like humanoids. Unfortunately, data collection is even more challenging here due to the difficulty of simultaneously controlling the two arms and multi-fingered hands. Automated data generation in simulation is a compelling, scalable alternative to fuel this need for training data. To this end, we introduce DexMimicGen, a large-scale automated data generation system that synthesizes trajectories from a handful of human demonstrations for bimanual robots with dexterous hands. We present a collection of simulation environments in the setting of bimanual dexterous manipulation, spanning a range of manipulation behaviors and different requirements for coordination among the two arms. We generate 21K demos across these tasks from just 60 source human demos and study the effect of several data generation and policy learning decisions on agent performance. Finally, we present a real-to-sim-to-real pipeline and deploy it on a real-world humanoid can sorting task. Generated datasets, simulation environments and additional results are at dexmimicgen.github.io.

17:05-17:10, Paper ThET22.7	Add to My Program
The Art of Imitation: Learning Long-Horizon Manipulation Tasks from Few Demonstrations

von Hartz, Jan Ole	University of Freiburg
Welschehold, Tim	Albert-Ludwigs-Universität Freiburg
Valada, Abhinav	University of Freiburg
Boedecker, Joschka	University of Freiburg
Keywords: Imitation Learning, Learning from Demonstration, Sensorimotor Learning Abstract: Task Parametrized Gaussian Mixture Models (TP-GMM) are a sample-efficient method for learning object-centric robot manipulation tasks. However, there are several open challenges to applying TP-GMMs in the wild. In this work, we tackle three crucial challenges synergistically. First, end-effector velocities are non-Euclidean and thus hard to model using standard GMMs. We thus propose to factorize the robot's end-effector velocity into its direction and magnitude, and model them using Riemannian GMMs. Second, we leverage the factorized velocities to segment and sequence skills from complex demonstration trajectories. Through the segmentation, we further align skill trajectories and hence leverage time as a powerful inductive bias. Third, we present a method to automatically detect relevant task parameters per skill from visual observations. Our approach enables learning complex manipulation tasks from just five demonstrations while using only RGB-D observations. Extensive experimental evaluations on RLBench demonstrate that our approach achieves state-of-the-art performance with 20-fold improved sample efficiency. Our policies generalize across different environments, object instances, and object positions, while the learned skills are reusable.

17:10-17:15, Paper ThET22.8	Add to My Program
ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos

Shi, Junyao	University of Pennsylvania
Zhao, Zhuolun	University of Pennsylvania, Skild AI
Wang, Tianyou	University of Pennsylvania
Pedroza, Ian	University of Pennsylvania
Luo, Amy	University of Pennsylvania
Wang, Jie	University of Pennsylvania
Ma, Yecheng Jason	University of Pennsylvania
Jayaraman, Dinesh	University of Pennsylvania
Keywords: Imitation Learning, Sensorimotor Learning, Transfer Learning Abstract: Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied real-world and simulated kitchen settings with two different robot embodiments, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints of our skill policies.


ThET23 Regular Session, 412	Add to My Program
Autonomous Vehicle Perception 7

Chair: Sun, Shunqiao	The University of Alabama
Co-Chair: Zhou, MengChu	New Jersey Institute of Technology

16:35-16:40, Paper ThET23.1	Add to My Program
Object Importance Estimation Using Counterfactual Reasoning for Intelligent Driving

Gupta, Pranay	Carnegie Mellon University
Biswas, Abhijat	Carnegie Mellon University
Admoni, Henny	Carnegie Mellon University
Held, David	Carnegie Mellon University
Keywords: Autonomous Vehicle Navigation, Intelligent Transportation Systems Abstract: The ability to identify important objects in a complex and dynamic driving environment is essential for autonomous driving agents to make safe and efficient driving decisions. It also helps assistive driving systems decide when to alert drivers. We tackle object importance estimation in a data-driven fashion and introduce HOIST -Human-annotated Object Importance in Simulated Traffic. HOIST contains driving scenarios with human-annotated importance labels for vehicles and pedestrians. We additionally propose a novel approach that relies on counterfactual reasoning to estimate an object's importance. We generate counterfactual scenarios by modifying the motion of objects and ascribe importance based on how the modifications affect the ego vehicle's driving. Our approach outperforms strong baselines for the task of object importance estimation on HOIST. We also perform ablation studies to justify our design choices and show the significance of the different components of our proposed approach.

16:40-16:45, Paper ThET23.2	Add to My Program
3D Multi-Modal Object Detection Based on Cross-Attention Feature Fusion

Jhong, Sin-Ye	Tamkang University
Ho, Min-Hsuan	National Taiwan University of Science and Technology
Lu, Si-Yu	National Taiwan University
Chen, Yung-Yao	National Taiwan University of Science and Technology
Keywords: Object Detection, Segmentation and Categorization, Sensor Fusion Abstract: In Advanced Driver Assistance Systems (ADAS), environmental perception and object detection are crucial for ensuring safe autonomous driving. Single-modality systems often struggle under adverse weather conditions, underscoring the need for multi-modal approaches. Current fusion methods typically rely on simplistic concatenation of multi-modal fea-tures, which neglects semantic alignment and does not fully exploit inter-modal correlations. This paper proposes a cross-attention feature fusion specifically designed to enhance the global correlation between camera and radar features. By dynamically adjusting feature weights through cross-attention, our approach significantly improves feature integration. Fur-thermore, we propose a depth-weighted voting fusion strategy to select the most accurate sensor depth, thereby enhancing decision-making stability. Experimental results on the nuScenes dataset show substantial improvements, with mean Average Precision (mAP) of 0.399 and mean Average Translation Error (mATE) of 0.602, highlighting the effectiveness of our approach in enhancing the robustness and accuracy of multi-modal fusion.

16:45-16:50, Paper ThET23.3	Add to My Program
Multi-Modality Test-Time Adaptation for Semantic Segmentation in Robotic Perception

Liu, Yan	Sun Yat-Sen Univerisity
Zhu, Hongyuan	A*STAR
Zhang, Ye	Sun Yat-Sen University
Lei, Yinjie	Sichuan University
Guo, Yulan	Sun Yat-Sen University
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Automation, Sensor Fusion Abstract: Test-Time Adaptation (TTA) adjusts pre-trained models among unlabeled unseen environments during the test phase, making it more practical for robotic applications. However, the constant changes of the physical world create significant domain gaps between the received data during robot deployment and the source data used for training. In addition, existing methods mainly focus on a single modality, {e.g.}, RGB images, limiting the application of these methods in multi-modality input scenarios. In this work, we propose a Deep Multi-modality Aggregation Test-time Adaptation (DMATA) method to address the above mentioned issues. To prevent the domain shifts from disrupting the adaptation process, we first propose a Momentum-based Teacher-Student (MTS) framework. Since the teacher model and the student model contain complementary information, we design an Uncertainty-Guide (UG) feature fusion block to fuse the teacher model and student model of each modality. Finally, we introduce a 3D-Guide-2D (3G2) feature fusion block to extract spatial information from RGB images. In this way, 2D feature extraction is enhanced.

16:50-16:55, Paper ThET23.4	Add to My Program
MDC-Seg: Multi-Directional Convolution-Based Semantic Segmentation for LiDAR Point Clouds

Ouyang, Xin	Northeastern University
Qian, Xiaolong	Northeastern University, China
Zhang, Yunzhou	Northeastern University
Shen, You	Northeastern University
Wang, Guiyuan	Jiangsu Shuguang Optoelectronics Co., Ltd., Yangzhou, China
Liu, Wei	Jiangsu Shuguang Optoelectronics Co., Ltd., Yangzhou, China
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: LiDAR point clouds 3D semantic segmentation enables efficient and accurate environmental sensing for intelligent vehicles and autonomous robots, greatly advancing these domains. Existing advanced methods use 3D sparse convolutional often suffer from a small Effective Receptive Field (ERF), limiting context sensing and challenging high-performance segmentation. Building on this observation, we propose MDC-Seg for efficient ERF enlargement. We design Multi-directional Convolution (MDConv), which simultaneously performs sparse feature encoding on the Bird's Eye View (BEV) and Range View (RV) planes to enlarge the ERF of 3D sparse convolution. To enhance feature fusion in MDConv, we introduce an attention mechanism and design an efficient multi-feature fusion (EMFF) module suitable for both 3D and 2D sparse features.To improve segmentation accuracy, we design a point-voxel constraint (PVC) module to handle edge voxels containing multiple point cloud categories, optimizing the final inference results. These modules add minimal memory and inference time but significantly improve performance compared to the baseline. Extensive experiments benchmarks on SemanticKITTI achieve excellent performance, while supplementary experiments on nuScenes also yield good results, demonstrating the superiority of MDC-Seg. The source code is available at https://github.com/OYgreat-river/MDC-Seg.

16:55-17:00, Paper ThET23.5	Add to My Program
Illumination Adaptation for SAM to Achieve Accurate Segmentation of Images Taken in Low-Light Scenes

Mu, Hongmin	Beijing University of Chemical Technology
Zhou, MengChu	New Jersey Institute of Technology
Cao, Zhengcai	Harbin Institute of Technology
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception Abstract: Achieving accurate segmentation in low-light scenes is challenging due to 1) severe domain shift encountered when models trained on daylight data are applied to such scenes and 2) lack of large-scale fine-grained labels in low-light conditions. A good idea is to use the generalization capabilities of segmentation foundation models like Segment Anything Model (SAM) to address the scarcity of annotated data. However, applying SAM to low-light scenes faces a severe domain shift issue due to the lack of inductive bias in effectively transforming low-light features into natural-light ones. To address this issue, we propose to adapt SAM for low-light scenes. To reduce the reliance on labels of low-light data, we develop a self-training method that makes SAM generate source-free predictions. To reduce the domain gap between low-light target data and SAM's natural-light trained data, we design a transformation head that enhances low-light features prior to the application of SAM. We further propose a domain shift compensation loss that trains our model to select a domain-adaptation-optimal illumination-enhanced feature map. Experimental results demonstrate that our method well outperforms the state of the art on the Dark Zurich and Nighttime Driving datasets.

17:00-17:05, Paper ThET23.6	Add to My Program
4DRadDet: Cluster-Queried Enhanced 3D Object Detection with 4D Radar

Weng, Caien	Tongji University
Bi, Xin	College of Automotive Studies，Tongji University
Tong, Panpan	Tongji University
Eichberger, Arno	Graz University of Technology
Keywords: Object Detection, Segmentation and Categorization, Intelligent Transportation Systems, Computer Vision for Automation Abstract: 3D object detection plays a critical role in advancing autonomous driving technology. To improve perception capabilities while maintaining low costs and ensuring performance in adverse weather conditions, 4D radar has emerged as a promising alternative for 3D object detection. However, current methods fail to fully exploit raw data and density information of 4D radar point clouds to tackle challenges like sparse data and noise. To address these limitations and make use of the unique Doppler velocity information provided by 4D radar, we propose a novel approach called 4DRadDet, which uses cross-attention fusion with cluster-queried techniques for 3D object detection. The 4DRadDet model uses a specially designed incremental clustering method to cluster potential object point clouds, reducing measurement errors from limited radar angular resolution and signal multipath effects. The cross-attention feature fusion (CAFF) module enhances network performance by querying the clustered point cloud feature map, allowing the network to leverage reliable prior information from the clustered point cloud to better detect potential objects. Our experimental evaluations on the View-of-Delft (VoD) dataset demonstrate the effectiveness of 4DRadDet, showcasing state-of-the-art performance. Specifically, 4DRadDet achieves a 3D mean average precision (mAP3D) of 51.44% and a bird's-eye view mean average precision (mAPBEV) of 57.07%. Our proposed method demonstrates impressive inference times and achieves real-time detection capabilities.

17:05-17:10, Paper ThET23.7	Add to My Program
Robust Visual Localization System with HD Map Based on Joint Probabilistic Data Association

Gu, Zizhen	Harbin Institute of Technology
Cheng, Shaowu	Harbin Institute of Technology
Wang, Chuan	Harbin Institute of Technology
Wang, Ruihan	Harbin Institute of Technology
Zhao, Yong	Harbin Institute of Technology
Keywords: Autonomous Vehicle Navigation, Vision-Based Navigation, Localization Abstract: Localization based on a high-definition (HD) map is a pivotal technology for autonomous driving. Nonetheless, establishing precise data association (DA) between detected landmarks and map landmarks presents a formidable challenge when leveraging prior information on maps. Traditional DA algorithms relying on nearest-neighbor methods only partially mitigate the ambiguity in DA caused by missed or false detections from the perception module, especially in complex and challenging environments. In this letter, we propose a novel joint probability data association (JPDA) algorithm. By integrating joint probability encompassing semantic likelihood, local spatial likelihood, and global structural likelihood of landmarks, alongside incorporating inter-frame temporal continuity of DA, the proposed algorithm can effectively rectify the erroneous DA. Additionally, we also introduce a max-mixture factor graph optimization framework, which couples the measurements of landmarks and odometry for pose estimation. Building upon these methods, a high-precision and robust visual semantic localization system employing consumer-level sensors has been developed. Experiments conducted on public datasets and real urban roads validate the efficacy of the proposed system in providing more robust and accurate localization results for autonomous driving vehicles.

17:10-17:15, Paper ThET23.8	Add to My Program
SALON: Self-Supervised Adaptive Learning for Off-Road Navigation

Sivaprakasam, Matthew	Carnegie Mellon University
Triest, Samuel	Carnegie Mellon University
Ho, Cherie	Carnegie Mellon University
Aich, Shubhra	Carnegie Mellon University Robotics Institute
Lew, Jeric Jieyi	National University of Singapore
Adu, Isaiah	Pennsylvania State University
Wang, Wenshan	Carnegie Mellon University
Scherer, Sebastian	Carnegie Mellon University
Keywords: Vision-Based Navigation, Learning from Experience, Field Robots Abstract: Autonomous robot navigation in off-road environments presents a number of challenges due to its lack of structure, making it difficult to handcraft robust heuristics for diverse scenarios. While learned methods using hand labels or self-supervised data improve generalizability, they often require a tremendous amount of data and can be vulnerable to domain shifts. To improve generalization in novel environments, recent works have incorporated adaptation and self-supervision to develop autonomous systems that can learn from their own experiences online. However, current works often rely on significant prior data, for example minutes of human teleoperation data for each terrain type, which is difficult to scale with more environments and robots. To address these limitations, we propose SALON, a perception-action framework for textit{fast} adaptation of traversability estimates with textit{minimal} human input. SALON rapidly learns online from experience while avoiding out of distribution terrains to produce adaptive and risk-aware cost and speed maps. Within textit{seconds} of collected experience, our results demonstrate comparable navigation performance over kilometer-scale courses in diverse off-road terrain as methods trained on 100-1000x more data. We additionally show promising results on significantly different robots in different environments. Our code is available at https://theairlab.org/SALON.

Technical Program for Thursday May 22, 2025